Large spatial model: End-to-end unposed images to semantic 3d

Z Fan, J Zhang, W Cong, P Wang, R Li… - Advances in …, 2025 - proceedings.neurips.cc
Reconstructing and understanding 3D structures from a limited number of images is a
classical problem in computer vision. Traditional approaches typically decompose this task …

CoTracker3: Simpler and better point tracking by pseudo-labelling real videos

N Karaev, I Makarov, J Wang, N Neverova… - arxiv preprint arxiv …, 2024 - arxiv.org
Most state-of-the-art point trackers are trained on synthetic data due to the difficulty of
annotating real videos for this task. However, this can result in suboptimal performance due …

Mvsplat360: Feed-forward 360 scene synthesis from sparse views

Y Chen, C Zheng, H Xu, B Zhuang, A Vedaldi… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce MVSplat360, a feed-forward approach for 360 {\deg} novel view synthesis
(NVS) of diverse real-world scenes, using only sparse observations. This setting is …

Animateanything: Consistent and controllable animation for video generation

G Lei, C Wang, H Li, R Zhang, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a unified controllable video generation approach AnimateAnything that
facilitates precise and consistent video manipulation across various conditions, including …

Can Visual Foundation Models Achieve Long-term Point Tracking?

G Aydemir, W **e, F Güney - arxiv preprint arxiv:2408.13575, 2024 - arxiv.org
Large-scale vision foundation models have demonstrated remarkable success across
various tasks, underscoring their robust generalization capabilities. While their proficiency in …

Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

Z Li, R Tucker, F Cole, Q Wang, L **, V Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a system that allows for accurate, fast, and robust estimation of camera
parameters and depth maps from casual monocular videos of dynamic scenes. Most …

UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

C Yuan, G Chen, L Yi, Y Gao - arxiv preprint arxiv:2411.09145, 2024 - arxiv.org
Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human
interactions with the physical world, attracting growing interest from the computer vision and …

Continuous 3D Perception Model with Persistent State

Q Wang, Y Zhang, A Holynski, AA Efros… - arxiv preprint arxiv …, 2025 - arxiv.org
We present a unified framework capable of solving a broad range of 3D tasks. Our approach
features a stateful recurrent model that continuously updates its state representation with …

Georecon: a coarse-to-fine visual 3D reconstruction approach for high-resolution images with neural matching priors

W Bei, X Fan, H Jian, X Du, D Yan, J Xu… - International Journal of …, 2024 - Taylor & Francis
Visual 3D reconstruction enables rebuilding 3D scenes from captured images, serving as a
fundamental data source for digital earth modeling and intelligent cities. In the foundational …

MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data

H Jiang, Z Xu, D **e, Z Chen, H **, F Luan… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose scaling up 3D scene reconstruction by training with synthesized data. At the
core of our work is MegaSynth, a procedurally generated 3D dataset comprising 700K …