Conceptgraphs: Open-vocabulary 3d scene graphs for perception and planning

Q Gu, A Kuwajerwala, S Morin… - … on Robotics and …, 2024 - ieeexplore.ieee.org
For robots to perform a wide variety of tasks, they require a 3D representation of the world
that is semantically rich, yet compact and efficient for task-driven perception and planning …

Shapellm: Universal 3d object understanding for embodied interaction

Z Qi, R Dong, S Zhang, H Geng, C Han, Z Ge… - … on Computer Vision, 2024 - Springer
This paper presents ShapeLLM, the first 3D Multimodal Large Language Model (LLM)
designed for embodied interaction, exploring a universal 3D object understanding with 3D …

Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

P Nguyen, TD Ngo, E Kalogerakis… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce Open3DIS a novel solution designed to tackle the problem of Open-
Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments …

Regionplc: Regional point-language contrastive learning for open-world 3d scene understanding

J Yang, R Ding, W Deng, Z Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
We propose a lightweight and scalable Regional Point-Language Contrastive learning
framework namely RegionPLC for open-world 3D scene understanding aiming to identify …

Openins3d: Snap and lookup for 3d open-vocabulary instance segmentation

Z Huang, X Wu, X Chen, H Zhao, L Zhu… - European Conference on …, 2024 - Springer
In this work, we introduce OpenIns3D, a new 3D-input-only framework for 3D open-
vocabulary scene understanding. The OpenIns3D framework employs a “Mask-Snap …

Grounded 3d-llm with referent tokens

Y Chen, S Yang, H Huang, T Wang, R Xu, R Lyu… - arxiv preprint arxiv …, 2024 - arxiv.org
Prior studies on 3D scene understanding have primarily developed specialized models for
specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D …

V-IRL: Grounding Virtual Intelligence in Real Life

J Yang, R Ding, E Brown, X Qi, S **e - European Conference on Computer …, 2024 - Springer
There is a sensory gulf between the Earth that humans inhabit and the digital realms in
which modern AI agents are created. To develop AI agents that can sense, think, and act as …

Open-vocabulary 3d semantic segmentation with text-to-image diffusion models

X Zhu, H Zhou, P **ng, L Zhao, H Xu, J Liang… - … on Computer Vision, 2024 - Springer
In this paper, we investigate the use of diffusion models which are pre-trained on large-scale
image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel …

Can 3D Vision-Language Models Truly Understand Natural Language?

W Deng, J Yang, R Ding, J Liu, Y Li, X Qi… - arxiv preprint arxiv …, 2024 - arxiv.org
Rapid advancements in 3D vision-language (3D-VL) tasks have opened up new avenues
for human interaction with embodied agents or robots using natural language. Despite this …

UniM-OV3D: uni-modality open-vocabulary 3D scene understanding with fine-grained feature representation

Q He, J Peng, Z Jiang, K Wu, X Ji, J Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
3D open-vocabulary scene understanding aims to recognize arbitrary novel categories
beyond the base label space. However, existing works not only fail to fully utilize all the …