Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Y Liu, Y Cao, Z Gao, W Wang, Z Chen, W Wang… - Science China …, 2024 - Springer
Despite the effectiveness of vision-language supervised fine-tuning in enhancing the
performance of vision large language models (VLLMs), existing visual instruction tuning …

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y **a, X Li, Z **a, A Chang, T Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) equip pre-trained large-language models
(LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied …

Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language understanding

Y Cao, Y Liu, Z Chen, G Shi, W Wang, D Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite significant advancements in Multimodal Large Language Models (MLLMs) for
understanding complex human intentions through cross-modal interactions, capturing …

Task preference optimization: Improving multimodal large language models with vision task alignment

Z Yan, Z Li, Y He, C Wang, K Li, X Li, X Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Current multimodal large language models (MLLMs) struggle with fine-grained or precise
understanding of visuals though they give comprehensive perception and reasoning in a …

Geoground: A unified large vision-language model. for remote sensing visual grounding

Y Zhou, M Lan, X Li, Y Ke, X Jiang, L Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Remote sensing (RS) visual grounding aims to use natural language expression to locate
specific objects (in the form of the bounding box or segmentation mask) in RS images …

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

J Ruan, W Yuan, Z Lin, N Liao, Z Li, F **ong… - arxiv preprint arxiv …, 2024 - arxiv.org
Large visual-language models (LVLMs) have achieved great success in multiple
applications. However, they still encounter challenges in complex scenes, especially those …

Multimodal 3D Reasoning Segmentation with Complex Scenes

X Jiang, L Lu, L Shao, S Lu - arxiv preprint arxiv:2411.13927, 2024 - arxiv.org
The recent development in multimodal learning has greatly advanced the research in 3D
scene understanding in various real-world tasks such as embodied AI. However, most …

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

W Wang, Z Li, Q Xu, L Li, YQ Cai, B Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have achieved remarkable success in fine-
grained visual understanding across a range of tasks. However, they often encounter …

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

X Zhou, D Liang, S Tu, X Chen, Y Ding… - arxiv preprint arxiv …, 2025 - arxiv.org
Driving World Models (DWMs) have become essential for autonomous driving by enabling
future scene prediction. However, existing DWMs are limited to scene generation and fail to …

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Q Jiang, Y Yang, Y **ong, Y Chen, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …