Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

J Lu, T Huang, P Li, Z Dou, C Lin, Z Cui, Z Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent developments in monocular depth estimation methods enable high-quality depth
estimation of single-view images but fail to estimate consistent video depth across different …

InfiniteWorld: A Unified Scalable Simulation Framework for General Visual-Language Robot Interaction

P Ren, M Li, Z Luo, X Song, Z Chen, W Liufu… - arxiv preprint arxiv …, 2024 - arxiv.org
Realizing scaling laws in embodied AI has become a focus. However, previous work has
been scattered across diverse simulation platforms, with assets and models lacking unified …

Survey on Monocular Metric Depth Estimation

J Zhang - arxiv preprint arxiv:2501.11841, 2025 - arxiv.org
Monocular Depth Estimation (MDE) is a fundamental computer vision task underpinning
applications such as spatial understanding, 3D reconstruction, and autonomous driving …

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

L **, R Tucker, Z Li, D Fouhey, N Snavely… - arxiv preprint arxiv …, 2024 - arxiv.org
Learning to understand dynamic 3D scenes from imagery is crucial for applications ranging
from robotics to scene reconstruction. Yet, unlike other problems where large-scale …

Local Policies Enable Zero-shot Long-horizon Manipulation

M Dalal, M Liu, W Talbott, C Chen, D Pathak… - arxiv preprint arxiv …, 2024 - arxiv.org
Sim2real for robotic manipulation is difficult due to the challenges of simulating complex
contacts and generating realistic task distributions. To tackle the latter problem, we introduce …

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

H Wang, Y Liu, Z Liu, W Wang, Z Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a
single-view image. Recent diffusion models enable generating high-quality novel-view …

DAViD: Modeling Dynamic Affordance of 3D Objects using Pre-trained Video Diffusion Models

H Kim, S Beak, H Joo - arxiv preprint arxiv:2501.08333, 2025 - arxiv.org
Understanding the ability of humans to use objects is crucial for AI to improve daily life.
Existing studies for learning such ability focus on human-object patterns (eg, contact, spatial …

FoundationStereo: Zero-Shot Stereo Matching

B Wen, M Trepte, J Aribido, J Kautz, O Gallo… - arxiv preprint arxiv …, 2025 - arxiv.org
Tremendous progress has been made in deep stereo matching to excel on benchmark
datasets through per-domain fine-tuning. However, achieving strong zero-shot …

MultiDepth: Multi-Sample Priors for Refining Monocular Metric Depth Estimations in Indoor Scenes

S Byun, J Song, WS Chung - arxiv preprint arxiv:2411.01048, 2024 - arxiv.org
Monocular metric depth estimation (MMDE) is a crucial task to solve for indoor scene
reconstruction on edge devices. Despite this importance, existing models are sensitive to …

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

Z Gu, R Yan, J Lu, P Li, Z Dou, C Si, Z Dong… - arxiv preprint arxiv …, 2025 - arxiv.org
Diffusion models have demonstrated impressive performance in generating high-quality
videos from text prompts or images. However, precise control over the video generation …