Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels
Current 3D scene segmentation methods are heavily dependent on manually annotated 3D
training datasets. Such manual annotations are labor-intensive, and often lack fine-grained …
training datasets. Such manual annotations are labor-intensive, and often lack fine-grained …
P2P-Bridge: Diffusion Bridges for 3D Point Cloud Denoising
In this work, we address the task of point cloud denoising using a novel framework adapting
Diffusion Schrödinger bridges to unstructured data like point sets. Unlike previous works that …
Diffusion Schrödinger bridges to unstructured data like point sets. Unlike previous works that …
Track4Gen: Teaching Video Diffusion Models to Track Points Improves Video Generation
While recent foundational video generators produce visually rich output, they still struggle
with appearance drift, where objects gradually degrade or change inconsistently across …
with appearance drift, where objects gradually degrade or change inconsistently across …
Desire-gs: 4d street gaussians for static-dynamic decomposition and surface reconstruction for urban driving scenes
We present DeSiRe-GS, a self-supervised gaussian splatting representation, enabling
effective static-dynamic decomposition and high-fidelity surface reconstruction in complex …
effective static-dynamic decomposition and high-fidelity surface reconstruction in complex …
Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
Given that visual foundation models (VFMs) are trained on extensive datasets but often
limited to 2D images, a natural question arises: how well do they understand the 3D world …
limited to 2D images, a natural question arises: how well do they understand the 3D world …
DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
Large-scale pre-trained vision models are becoming increasingly prevalent, offering
expressive and generalizable visual representations that benefit various downstream tasks …
expressive and generalizable visual representations that benefit various downstream tasks …
Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning
Vision foundation models, particularly the ViT family, have revolutionized image
understanding by providing rich semantic features. However, despite their success in 2D …
understanding by providing rich semantic features. However, despite their success in 2D …
On Unifying Video Generation and Camera Pose Estimation
Inspired by the emergent 3D capabilities in image generators, we explore whether video
generators similarly exhibit 3D awareness. Using structure-from-motion (SfM) as a …
generators similarly exhibit 3D awareness. Using structure-from-motion (SfM) as a …
LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models
Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for
in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem …
in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem …
Camera Pose Estimation Emerging In Video Diffusion Transformer
Diffusion-based video generators are now a reality. Being trained on a large corpus of real
videos, such models can generate diverse yet realistic videos (Brooks et al., 2024; Zheng et …
videos, such models can generate diverse yet realistic videos (Brooks et al., 2024; Zheng et …