Rvt: Robotic view transformer for 3d object manipulation
For 3D object manipulation, methods that build an explicit 3D representation perform better
than those relying only on camera images. But using explicit 3D representations like voxels …
than those relying only on camera images. But using explicit 3D representations like voxels …
Unit3d: A unified transformer for 3d dense captioning and visual grounding
Performing 3D dense captioning and visual grounding requires a common and shared
understanding of the underlying multimodal relationships. However, despite some previous …
understanding of the underlying multimodal relationships. However, despite some previous …
Invariant training 2d-3d joint hard samples for few-shot point cloud recognition
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by
using a joint prediction from a conventional 3D model and a well-pretrained 2D model …
using a joint prediction from a conventional 3D model and a well-pretrained 2D model …
Self-supervised learning for pre-training 3d point clouds: A survey
Point cloud data has been extensively studied due to its compact form and flexibility in
representing complex 3D structures. The ability of point cloud data to accurately capture and …
representing complex 3D structures. The ability of point cloud data to accurately capture and …
Point cloud self-supervised learning via 3d to multi-view masked autoencoder
In recent years, the field of 3D self-supervised learning has witnessed significant progress,
resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that …
resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that …
Tracknerf: Bundle adjusting nerf from sparse and noisy views via feature tracks
Neural radiance fields (NeRFs) generally require many images with accurate poses for
accurate novel view synthesis, which does not reflect realistic setups where views can be …
accurate novel view synthesis, which does not reflect realistic setups where views can be …
Pix4point: Image pretrained transformers for 3d point cloud understanding
Pure Transformer models have achieved impressive success in natural language
processing and computer vision. However, one limitation with Transformers is their need for …
processing and computer vision. However, one limitation with Transformers is their need for …
Egoloc: Revisiting 3d object localization from egocentric videos with visual queries
With the recent advances in video and 3D understanding, novel 4D spatio-temporal
methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic …
methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic …
Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
Contrastive learning has emerged as a promising paradigm for 3D open-world
understanding ie aligning point cloud representation to image and text embedding space …
understanding ie aligning point cloud representation to image and text embedding space …
MVTN: Learning multi-view transformations for 3D understanding
Multi-view projection techniques have shown themselves to be highly effective in achieving
top-performing results in the recognition of 3D shapes. These methods involve learning how …
top-performing results in the recognition of 3D shapes. These methods involve learning how …