- Academic Search

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Gem Citer Citeret af 214 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer

Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

Gem Citer Citeret af 745 Relaterede artikler Alle 3 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Minicpm-v: A gpt-4v level mllm on your phone

Y Yao, T Yu, A Zhang, C Wang, J Cui, H Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally
reshaped the landscape of AI research and industry, shedding light on a promising path …

Gem Citer Citeret af 191 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer

We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Gem Citer Citeret af 99 Relaterede artikler Alle 2 versioner

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

T Zhang, X Li, H Fei, H Yuan, S Wu… - Advances in …, 2025 - proceedings.neurips.cc

Current universal segmentation methods demonstrate strong capabilities in pixel-level
image and video understanding. However, they lack reasoning abilities and cannot be …

Gem Citer Citeret af 30 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

X Dong, P Zhang, Y Zang, Y Cao, B Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-
form text-image composition and comprehension. This model goes beyond conventional …

Gem Citer Citeret af 218 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

K Ying, F Meng, J Wang, Z Li, H Lin, Y Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Vision-Language Models (LVLMs) show significant strides in general-purpose
multimodal applications such as visual dialogue and embodied navigation. However …

Gem Citer Citeret af 60 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Moai: Mixture of all intelligence for large language and vision models

BK Lee, B Park, C Won Kim, Y Man Ro - European Conference on …, 2024 - Springer

The rise of large language models (LLMs) and instruction tuning has led to the current trend
of instruction-tuned large language and vision models (LLVMs). This trend involves either …

Gem Citer Citeret af 14 Relaterede artikler Alle 2 versioner

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Gem Citer Citeret af 17 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos

J Fei, D Li, Z Deng, Z Wang, G Liu, H Wang - arxiv preprint arxiv …, 2024 - arxiv.org

Multi-modal large language models (MLLMs) have demonstrated considerable potential
across various downstream tasks that require cross-domain knowledge. MLLMs capable of …

Gem Citer Citeret af 21 Relaterede artikler Alle 3 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Xtuner: A toolkit for efficiently fine-tuning llm

Mm-llms: Recent advances in multimodal large language models

Mmbench: Is your multi-modal model an all-around player?

Minicpm-v: A gpt-4v level mllm on your phone

Blink: Multimodal large language models can see but not perceive

Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model

Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi

Moai: Mixture of all intelligence for large language and vision models

A survey on evaluation of multimodal large language models

Video-ccam: Enhancing video-language understanding with causal cross-attention masks for short and long videos