- Academic Search

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-lan...

Y Liu, Y Cao, Z Gao, W Wang, Z Chen, W Wang… - Science China …, 2024 - Springer

Despite the effectiveness of vision-language supervised fine-tuning in enhancing the
performance of vision large language models (VLLMs), existing visual instruction tuning …

Lagre Referanse Sitert av 16 Beslektede artikler Alle 4 versjoner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Visual prompting in multimodal large language models: A survey

J Wu, Z Zhang, Y **a, X Li, Z **a, A Chang, T Yu… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) equip pre-trained large-language models
(LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied …

Lagre Referanse Sitert av 8 Beslektede artikler Alle 3 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language understanding

Y Cao, Y Liu, Z Chen, G Shi, W Wang, D Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite significant advancements in Multimodal Large Language Models (MLLMs) for
understanding complex human intentions through cross-modal interactions, capturing …

Lagre Referanse Sitert av 4 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Task preference optimization: Improving multimodal large language models with vision task alignment

Z Yan, Z Li, Y He, C Wang, K Li, X Li, X Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org

Current multimodal large language models (MLLMs) struggle with fine-grained or precise
understanding of visuals though they give comprehensive perception and reasoning in a …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Geoground: A unified large vision-language model. for remote sensing visual grounding

Y Zhou, M Lan, X Li, Y Ke, X Jiang, L Feng… - arxiv preprint arxiv …, 2024 - arxiv.org

Remote sensing (RS) visual grounding aims to use natural language expression to locate
specific objects (in the form of the bounding box or segmentation mask) in RS images …

Lagre Referanse Sitert av 2 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

J Ruan, W Yuan, Z Lin, N Liao, Z Li, F **ong… - arxiv preprint arxiv …, 2024 - arxiv.org

Large visual-language models (LVLMs) have achieved great success in multiple
applications. However, they still encounter challenges in complex scenes, especially those …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Multimodal 3D Reasoning Segmentation with Complex Scenes

X Jiang, L Lu, L Shao, S Lu - arxiv preprint arxiv:2411.13927, 2024 - arxiv.org

The recent development in multimodal learning has greatly advanced the research in 3D
scene understanding in various real-world tasks such as embodied AI. However, most …

Lagre Referanse Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

W Wang, Z Li, Q Xu, L Li, YQ Cai, B Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multi-modal large language models (MLLMs) have achieved remarkable success in fine-
grained visual understanding across a range of tasks. However, they often encounter …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

X Zhou, D Liang, S Tu, X Chen, Y Ding… - arxiv preprint arxiv …, 2025 - arxiv.org

Driving World Models (DWMs) have become essential for autonomous driving by enabling
future scene prediction. However, existing DWMs are limited to scene generation and fail to …

Lagre Referanse Sitert av 1 Beslektede artikler Alle 2 versjoner HTML-versjon

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Q Jiang, Y Yang, Y **ong, Y Chen, Z Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org

Perception and understanding are two pillars of computer vision. While multimodal large
language models (MLLM) have demonstrated remarkable visual understanding capabilities …

Lagre Referanse Beslektede artikler Alle 2 versjoner HTML-versjon

Opprett varsel

Referanse

Avansert søk

Lagret i Mitt bibliotek

Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-lan...

Mminstruct: A high-quality multi-modal instruction tuning dataset with extensive diversity

Visual prompting in multimodal large language models: A survey

Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language understanding

Task preference optimization: Improving multimodal large language models with vision task alignment

Geoground: A unified large vision-language model. for remote sensing visual grounding

MM-CamObj: A Comprehensive Multimodal Dataset for Camouflaged Object Scenarios

Multimodal 3D Reasoning Segmentation with Complex Scenes

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding