3d-vista: Pre-trained transformer for 3d vision and text alignment

Z Zhu, X Ma, Y Chen, Z Deng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract 3D vision-language grounding (3D-VL) is an emerging field that aims to connect the
3D physical world with natural language, which is crucial for achieving embodied …

Shapellm: Universal 3d object understanding for embodied interaction

Z Qi, R Dong, S Zhang, H Geng, C Han, Z Ge… - … on Computer Vision, 2024 - Springer
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

B Jia, Y Chen, H Yu, Y Wang, X Niu, T Liu, Q Li… - … on Computer Vision, 2024 - Springer
Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in develo** embodied agents. In …

Openeqa: Embodied question answering in the era of foundation models

A Majumdar, A Ajay, X Zhang, P Putta… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present a modern formulation of Embodied Question Answering (EQA) as the task of
understanding an environment well enough to answer questions about it in natural …

Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

T Qian, J Chen, L Zhuo, Y Jiao, YG Jiang - Proceedings of the AAAI …, 2024 - ojs.aaai.org
We introduce a novel visual question answering (VQA) task in the context of autonomous
driving, aiming to answer natural language questions based on street-view clues. Compared …

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

S Chen, X Chen, C Zhang, M Li, G Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …