- Academic Search

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Tallenna Viittaa Viittausten määrä 1236 Aiheeseen liittyviä artikkeleita Kaikki 12 versiota

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

The revolution of multimodal large language models: a survey

D Caffagni, F Cocchi, L Barsellotti, N Moratelli… - arxiv preprint arxiv …, 2024 - arxiv.org

Connecting text and visual modalities plays an essential role in generative intelligence. For
this reason, inspired by the success of large language models, significant research efforts …

Tallenna Viittaa Viittausten määrä 46 Aiheeseen liittyviä artikkeleita Kaikki 9 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

Tallenna Viittaa Viittausten määrä 361 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

Tallenna Viittaa Viittausten määrä 259 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Tallenna Viittaa Viittausten määrä 386 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

X Dong, P Zhang, Y Zang, Y Cao… - Advances in …, 2025 - proceedings.neurips.cc

Abstract The Large Vision-Language Model (LVLM) field has seen significant
advancements, yet its progression has been hindered by challenges in comprehending fine …

Tallenna Viittaa Viittausten määrä 111 Aiheeseen liittyviä artikkeleita Kaikki 5 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

Tallenna Viittaa Viittausten määrä 220 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Z Guo, R Xu, Y Yao, J Cui, Z Ni, C Ge, TS Chua… - … on Computer Vision, 2024 - Springer

Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding
the visual world. Conventional LMMs process images in fixed sizes and limited resolutions …

Tallenna Viittaa Viittausten määrä 104 Aiheeseen liittyviä artikkeleita Kaikki 8 versiota

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Hallucination augmented contrastive learning for multimodal large language model

C Jiang, H Xu, M Dong, J Chen, W Ye… - Proceedings of the …, 2024 - openaccess.thecvf.com

Multi-modal large language models (MLLMs) have been shown to efficiently integrate
natural language with visual information to handle multi-modal tasks. However MLLMs still …

Tallenna Viittaa Viittausten määrä 74 Aiheeseen liittyviä artikkeleita Kaikki 5 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arxiv preprint arxiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

Tallenna Viittaa Viittausten määrä 81 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

Luo ilmoitus

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Ureader: Universal ocr-free visually-situated language understanding with multimodal large...

A Survey of Multimodel Large Language Models

The revolution of multimodal large language models: a survey

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Cogagent: A visual language model for gui agents

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd

Deepseek-vl: towards real-world vision-language understanding

Llava-uhd: an lmm perceiving any aspect ratio and high-resolution images

Hallucination augmented contrastive learning for multimodal large language model

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output