A survey on text-guided 3D visual grounding: elements, recent advances, and future directions

D Liu, Y Liu, W Huang, W Hu - ar** Manga
Y Wu, X Hu, Y Sun, Y Zhou, W Zhu, F Rao… - arxiv preprint arxiv …, 2024 - arxiv.org
Video Large Language Models (Vid-LLMs) have made remarkable advancements in
comprehending video content for QA dialogue. However, they struggle to extend this visual …

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models

Z Qi, Z Zhang, Y Fang, J Wang, H Zhao - arxiv preprint arxiv:2501.01428, 2025 - arxiv.org
In recent years, 2D Vision-Language Models (VLMs) have made significant strides in image-
text understanding tasks. However, their performance in 3D spatial comprehension, which is …

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

H Zhang, CA Yang, RA Yeh - arxiv preprint arxiv:2410.22306, 2024 - arxiv.org
Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from
a point cloud. It is a challenging and significant task with numerous applications in visual …

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding

H Zheng, H Shi, Q Peng, YX Chng, R Huang… - … Conference on Learning … - openreview.net
Enabling intelligent agents to comprehend and interact with 3D environments through
natural language is crucial for advancing robotics and human-computer interaction. A …