A Survey of Multimodel Large Language Models

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org
With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Unifying 3d vision-language understanding via promptable queries

Z Zhu, Z Zhang, X Ma, X Niu, Y Chen, B Jia… - … on Computer Vision, 2024 - Springer
A unified model for 3D vision-language (3D-VL) understanding is expected to take various
scene representations and perform a wide range of tasks in a 3D scene. However, a …

View selection for 3d captioning via diffusion ranking

T Luo, J Johnson, H Lee - European Conference on Computer Vision, 2024 - Springer
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets,
facilitating a broader range of applications. However, existing methods sometimes lead to …

An embodied generalist agent in 3d world

J Huang, S Yong, X Ma, X Linghu, P Li, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Leveraging massive knowledge and learning schemes from large language models (LLMs),
recent machine learning models show notable successes in building generalist agents that …

Tod3cap: Towards 3d dense captioning in outdoor scenes

B **, Y Zheng, P Li, W Li, Y Zheng, S Hu, X Liu… - … on Computer Vision, 2024 - Springer
Abstract 3D dense captioning stands as a cornerstone in achieving a comprehensive
understanding of 3D scenes through natural language. It has recently witnessed remarkable …

Motionchain: Conversational motion controllers via multimodal prompts

B Jiang, X Chen, C Zhang, F Yin, Z Li, G Yu… - European Conference on …, 2024 - Springer
Recent advancements in language models have demonstrated their adeptness in
conducting multi-turn dialogues and retaining conversational context. However, this …

Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors

Y Tang, X Han, X Li, Q Yu, Y Hao, L Hu… - Proceedings of the 32nd …, 2024 - dl.acm.org
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging
Large Language Models (LLMs) with images using a simple projector. Inspired by their …

A survey of label-efficient deep learning for 3D point clouds

A **ao, X Zhang, L Shao, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
In the past decade, deep neural networks have achieved significant progress in point cloud
learning. However, collecting large-scale precisely-annotated point clouds is extremely …

Chat-scene: Bridging 3d scene and large language models with object identifiers

H Huang, Y Chen, Z Wang, R Huang, R Xu… - The Thirty-eighth …, 2024 - openreview.net
Recent advancements in 3D Large Language Models (LLMs) have demonstrated promising
capabilities for 3D scene understanding. However, previous methods exhibit deficiencies in …

Scanreason: Empowering 3d visual grounding with reasoning capabilities

C Zhu, T Wang, W Zhang, K Chen, X Liu - European Conference on …, 2024 - Springer
Although great progress has been made in 3D visual grounding, current models still rely on
explicit textual descriptions for grounding and lack the ability to reason human intentions …