Video-chatgpt: Towards detailed video understanding via large vision and language models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to
interact with visual data. While there have been initial attempts for image-based …
interact with visual data. While there have been initial attempts for image-based …
Decomposing nerf for editing via feature field distillation
Emerging neural radiance fields (NeRF) are a promising scene representation for computer
graphics, enabling high-quality 3D reconstruction and novel view synthesis from image …
graphics, enabling high-quality 3D reconstruction and novel view synthesis from image …
Openscene: 3d scene understanding with open vocabularies
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a
model for a single task with supervision. We propose OpenScene, an alternative approach …
model for a single task with supervision. We propose OpenScene, an alternative approach …
Scannet++: A high-fidelity dataset of 3d indoor scenes
We present ScanNet++, a large-scale dataset that couples together capture of high-quality
and commodity-level geometry and color of indoor scenes. Each scene is captured with a …
and commodity-level geometry and color of indoor scenes. Each scene is captured with a …
Clip2scene: Towards label-efficient 3d scene understanding by clip
Abstract Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D
zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP …
zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP …
Point Transformer V3: Simpler Faster Stronger
This paper is not motivated to seek innovation within the attention mechanism. Instead it
focuses on overcoming the existing trade-offs between accuracy and efficiency within the …
focuses on overcoming the existing trade-offs between accuracy and efficiency within the …
Openshape: Scaling up 3d shape representation towards open-world understanding
We introduce OpenShape, a method for learning multi-modal joint representations of text,
image, and point clouds. We adopt the commonly used multi-modal contrastive learning …
image, and point clouds. We adopt the commonly used multi-modal contrastive learning …
Towards large-scale 3d representation learning with multi-dataset point prompt training
The rapid advancement of deep learning models is often attributed to their ability to leverage
massive training data. In contrast such privilege has not yet fully benefited 3D deep learning …
massive training data. In contrast such privilege has not yet fully benefited 3D deep learning …
Mask3d: Mask transformer for 3d semantic instance segmentation
Modern 3D semantic instance segmentation approaches predominantly rely on specialized
voting mechanisms followed by carefully designed geometric clustering techniques. Building …
voting mechanisms followed by carefully designed geometric clustering techniques. Building …
Multi3drefer: Grounding text description to multiple 3d objects
We introduce the task of localizing a flexible number of objects in real-world 3D scenes
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …
using natural language descriptions. Existing 3D visual grounding tasks focus on localizing …