Lotus: Diffusion-based visual foundation model for high-quality dense prediction

J He, H Li, W Yin, Y Liang, L Li, K Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising
solution to enhance zero-shot generalization in dense prediction tasks. However, existing …

Taptrv2: Attention-based position update improves tracking any point

H Li, H Zhang, S Liu, Z Zeng, F Li, T Ren, B Li… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for
solving the Tracking Any Point (TAP) task. TAPTR borrows designs from DEtection …

Robo-gs: A physics consistent spatial-temporal model for robotic arm with hybrid representation

H Lou, Y Liu, Y Pan, Y Geng, J Chen, W Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Real2Sim2Real plays a critical role in robotic arm control and reinforcement learning, yet
bridging this gap remains a significant challenge due to the complex physical properties of …

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

J Lu, T Huang, P Li, Z Dou, C Lin, Z Cui, Z Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent developments in monocular depth estimation methods enable high-quality depth
estimation of single-view images but fail to estimate consistent video depth across different …

ObjCtrl-2.5 D: Training-free Object Control with Camera Poses

Z Wang, Y Lan, S Zhou, CC Loy - arxiv preprint arxiv:2412.07721, 2024 - arxiv.org
This study aims to achieve more precise and versatile object control in image-to-video (I2V)
generation. Current methods typically represent the spatial movement of target objects with …

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

H Jeong, CHP Huang, JC Ye, N Mitra… - arxiv preprint arxiv …, 2024 - arxiv.org
While recent foundational video generators produce visually rich output, they still struggle
with appearance drift, where objects gradually degrade or change inconsistently across …

ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

T Zhang, C Wang, Z Dou, Q Gao, J Lei, B Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
In this paper, we propose ProTracker, a novel framework for robust and accurate long-term
dense tracking of arbitrary points in videos. The key idea of our method is incorporating …

Trajectory-aligned Space-time Tokens for Few-shot Action Recognition

P Kumar, N Padmanabhan, L Luo… - … on Computer Vision, 2024 - Springer
We propose a simple yet effective approach for few-shot action recognition, emphasizing the
disentanglement of motion and appearance representations. By harnessing recent progress …

Towards Robust Automation of Surgical Systems via Digital Twin-based Scene Representations from Foundation Models

H Ding, L Seenivasan, H Shu, G Byrd, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language model-based (LLM) agents are emerging as a powerful enabler of robust
embodied intelligence due to their capability of planning complex action sequences. Sound …

Hybrid Cost Volume for Memory-Efficient Optical Flow

Y Zhao, G Xu, G Wu - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
Current state-of-the-art flow methods are mostly based on dense all-pairs cost volumes.
However, as image resolution increases, the computational and spatial complexity of …