Google Učenjak

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-langu...

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Shrani Navedi Navedeno v 231 virih Sorodni članki Vse različice: 6 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] mdpi.com

From large language models to large multimodal models: A literature review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com

With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

Shrani Navedi Navedeno v 20 virih Sorodni članki Vse različice: 2 Posnetek

[免费ChatGPT] [DeepSeek可用网址] [PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Shrani Navedi Navedeno v 610 virih Sorodni članki Vse različice: 7 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer

Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

Shrani Navedi Navedeno v 809 virih Sorodni članki Vse različice: 9

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Shrani Navedi Navedeno v 403 virih Sorodni članki Vse različice: 4

[免费ChatGPT] [DeepSeek可用网址] [PDF] neurips.cc

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao… - Advances in …, 2025 - proceedings.neurips.cc

Abstract The Large Vision-Language Model (LVLM) field has seen significant
advancements, yet its progression has been hindered by challenges in comprehending fine …

Shrani Navedi Navedeno v 115 virih Sorodni članki Vse različice: 5 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] neurips.cc

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang… - Advances in …, 2025 - proceedings.neurips.cc

Abstract We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation of text-to …

Shrani Navedi Navedeno v 102 virih Sorodni članki Vse različice: 5 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Shrani Navedi Navedeno v 231 virih Sorodni članki Vse različice: 4 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] arxiv.org

Mini-gemini: Mining the potential of multi-modality vision language models

Y Li, Y Zhang, C Wang, Z Zhong, Y Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-
modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating …

Shrani Navedi Navedeno v 213 virih Sorodni članki Vse različice: 5 V obliki HTML

[免费ChatGPT] [DeepSeek可用网址] [PDF] thecvf.com

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation

Q Huang, X Dong, P Zhang, B Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Hallucination posed as a pervasive challenge of multi-modal large language models
(MLLMs) has significantly impeded their real-world usage that demands precise judgment …

Shrani Navedi Navedeno v 149 virih Sorodni članki Vse različice: 8 V obliki HTML

Ustvari opozorilo

Navedi

Napredno iskanje

Shranjeno v Mojo knjižnico

Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-langu...

Mm-llms: Recent advances in multimodal large language models

From large language models to large multimodal models: A literature review

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Mmbench: Is your multi-modal model an all-around player?

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

Sharegpt4video: Improving video understanding and generation with better captions

Deepseek-vl: towards real-world vision-language understanding

Mini-gemini: Mining the potential of multi-modality vision language models

Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation