Rvt: Robotic view transformer for 3d object manipulation

A Goyal, J Xu, Y Guo, V Blukis… - Conference on Robot …, 2023 - proceedings.mlr.press
For 3D object manipulation, methods that build an explicit 3D representation perform better
than those relying only on camera images. But using explicit 3D representations like voxels …

Unit3d: A unified transformer for 3d dense captioning and visual grounding

Z Chen, R Hu, X Chen, M Nießner… - Proceedings of the …, 2023 - openaccess.thecvf.com
Performing 3D dense captioning and visual grounding requires a common and shared
understanding of the underlying multimodal relationships. However, despite some previous …

Invariant training 2d-3d joint hard samples for few-shot point cloud recognition

X Yi, J Deng, Q Sun, XS Hua… - Proceedings of the …, 2023 - openaccess.thecvf.com
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by
using a joint prediction from a conventional 3D model and a well-pretrained 2D model …

Self-supervised learning for pre-training 3d point clouds: A survey

B Fei, W Yang, L Liu, T Luo, R Zhang, Y Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Point cloud data has been extensively studied due to its compact form and flexibility in
representing complex 3D structures. The ability of point cloud data to accurately capture and …

Point cloud self-supervised learning via 3d to multi-view masked autoencoder

Z Chen, Y Li, L **g, L Yang, B Li - arxiv preprint arxiv:2311.10887, 2023 - arxiv.org
In recent years, the field of 3D self-supervised learning has witnessed significant progress,
resulting in the emergence of Multi-Modality Masked AutoEncoders (MAE) methods that …

Tracknerf: Bundle adjusting nerf from sparse and noisy views via feature tracks

J Mai, W Zhu, S Rojas, J Zarzar, A Hamdi… - … on Computer Vision, 2024 - Springer
Neural radiance fields (NeRFs) generally require many images with accurate poses for
accurate novel view synthesis, which does not reflect realistic setups where views can be …

Pix4point: Image pretrained transformers for 3d point cloud understanding

G Qian, X Zhang, A Hamdi, B Ghanem - 2022 - repository.kaust.edu.sa
Pure Transformer models have achieved impressive success in natural language
processing and computer vision. However, one limitation with Transformers is their need for …

Egoloc: Revisiting 3d object localization from egocentric videos with visual queries

J Mai, A Hamdi, S Giancola, C Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com
With the recent advances in video and 3D understanding, novel 4D spatio-temporal
methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic …

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training

Y Gao, Z Wang, WS Zheng, C **e… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive learning has emerged as a promising paradigm for 3D open-world
understanding ie aligning point cloud representation to image and text embedding space …

MVTN: Learning multi-view transformations for 3D understanding

A Hamdi, F AlZahrani, S Giancola… - International Journal of …, 2024 - Springer
Multi-view projection techniques have shown themselves to be highly effective in achieving
top-performing results in the recognition of 3D shapes. These methods involve learning how …