3d-vista: Pre-trained transformer for 3d vision and text alignment
Abstract 3D vision-language grounding (3D-VL) is an emerging field that aims to connect the
3D physical world with natural language, which is crucial for achieving embodied …
3D physical world with natural language, which is crucial for achieving embodied …
Shapellm: Universal 3d object understanding for embodied interaction
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …
designed for embodied interaction, exploring a universal 3D object understanding with 3D …
SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Abstract 3D vision-language (3D-VL) grounding, which aims to align language with 3D
physical environments, stands as a cornerstone in develo** embodied agents. In …
physical environments, stands as a cornerstone in develo** embodied agents. In …
Openeqa: Embodied question answering in the era of foundation models
We present a modern formulation of Embodied Question Answering (EQA) as the task of
understanding an environment well enough to answer questions about it in natural …
understanding an environment well enough to answer questions about it in natural …
Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario
We introduce a novel visual question answering (VQA) task in the context of autonomous
driving, aiming to answer natural language questions based on street-view clues. Compared …
driving, aiming to answer natural language questions based on street-view clues. Compared …
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
Abstract Recent progress in Large Multimodal Models (LMM) has opened up great
possibilities for various applications in the field of human-machine interactions. However …
possibilities for various applications in the field of human-machine interactions. However …