Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels

R Huang, S Peng, A Takmaz, F Tombari… - … on Computer Vision, 2024 - Springer
Current 3D scene segmentation methods are heavily dependent on manually annotated 3D
training datasets. Such manual annotations are labor-intensive, and often lack fine-grained …

P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising

M Vogel, K Tateno, M Pollefeys, F Tombari… - … on Computer Vision, 2024 - Springer
In this work, we address the task of point cloud denoising using a novel framework adapting
Diffusion Schrödinger bridges to unstructured data like point sets. Unlike previous works that …

Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation

H Jeong, CHP Huang, JC Ye, N Mitra… - arxiv preprint arxiv …, 2024 - arxiv.org
While recent foundational video generators produce visually rich output, they still struggle
with appearance drift, where objects gradually degrade or change inconsistently across …

Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes

C Peng, C Zhang, Y Wang, C Xu, Y **e… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling
effective static-dynamic decomposition and high-fidelity surface reconstruction in complex …

Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Y Chen, X Chen, A Chen, G Pons-Moll… - arxiv preprint arxiv …, 2024 - arxiv.org
Given that visual foundation models (VFMs) are trained on extensive datasets but often
limited to 2D images, a natural question arises: how well do they understand the 3D world …

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

D Danier, M Aygün, C Li, H Bilen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large-scale pre-trained vision models are becoming increasingly prevalent, offering
expressive and generalizable visual representations that benefit various downstream tasks …

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Y You, Y Li, C Deng, Y Wang, L Guibas - arxiv preprint arxiv:2411.19458, 2024 - arxiv.org
Vision foundation models, particularly the ViT family, have revolutionized image
understanding by providing rich semantic features. However, despite their success in 2D …

On Unifying Video Generation and Camera Pose Estimation

CHP Huang, JS Yoon, H Jeong, N Mitra… - arxiv preprint arxiv …, 2025 - arxiv.org
Inspired by the emergent 3D capabilities in image generators, we explore whether video
generators similarly exhibit 3D awareness. Using structure-from-motion (SfM) as a …

LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Z Lu, H Yang, D Xu, B Li, B Ivanovic, M Pavone… - arxiv preprint arxiv …, 2024 - arxiv.org
Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for
in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem …

Camera Pose Estimation Emerging In Video Diffusion Transformer

CHP Huang, JS Yoon, H Jeong, N Mitra, D Ceylan - openreview.net
Diffusion-based video generators are now a reality. Being trained on a large corpus of real
videos, such models can generate diverse yet realistic videos (Brooks et al., 2024; Zheng et …