Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Grounding and evaluation for large language models: Practical challenges and lessons learned (survey)

K Kenthapadi, M Sameki, A Taly - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org
With the ongoing rapid adoption of Artificial Intelligence (AI)-based systems in high-stakes
domains, ensuring the trustworthiness, safety, and observability of these systems has …

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

L Xue, N Yu, S Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advancements in multimodal pre-training have shown promising efficacy in 3D
representation learning by aligning multimodal features across 3D shapes their 2D …

Uni-moe: Scaling unified multimodal llms with mixture of experts

Y Li, S Jiang, B Hu, L Wang, W Zhong… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Recent advancements in Multimodal Large Language Models (MLLMs) underscore the
significance of scalable models and data to boost performance, yet this often incurs …

A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

A comprehensive review of multimodal large language models: Performance and challenges across different tasks

J Wang, H Jiang, Y Liu, C Ma, X Zhang, Y Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
In an era defined by the explosive growth of data and rapid technological advancements,
Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence …

View selection for 3d captioning via diffusion ranking

T Luo, J Johnson, H Lee - European Conference on Computer Vision, 2024 - Springer
Scalable annotation approaches are crucial for constructing extensive 3D-text datasets,
facilitating a broader range of applications. However, existing methods sometimes lead to …

Minigpt-3d: Efficiently aligning 3d point clouds with large language models using 2d priors

Y Tang, X Han, X Li, Q Yu, Y Hao, L Hu… - Proceedings of the 32nd …, 2024 - dl.acm.org
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging
Large Language Models (LLMs) with images using a simple projector. Inspired by their …

Meerkat: Audio-visual large language model for grounding in space and time

S Chowdhury, S Nag, S Dasgupta, J Chen… - … on Computer Vision, 2024 - Springer
Abstract Leveraging Large Language Models' remarkable proficiency in text-based tasks,
recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and …

[PDF][PDF] Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion

S Yu, J Yoon, M Bansal - arxiv preprint arxiv:2402.05889, 2024 - southnlp.github.io
Despite impressive advancements in multimodal compositional reasoning approaches, they
are still limited in their flexibility and efficiency by processing fixed modality inputs while …