Google Academic

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Salvați Citați Citat de 1253 ori Articole cu conținut similar Toate cele 12 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Salvați Citați Citat de 230 ori Articole cu conținut similar Toate cele 6 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Salvați Citați Citat de 603 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mmbench: Is your multi-modal model an all-around player?

Y Liu, H Duan, Y Zhang, B Li, S Zhang, W Zhao… - European conference on …, 2024 - Springer

Large vision-language models (VLMs) have recently achieved remarkable progress,
exhibiting impressive multimodal perception and reasoning abilities. However, effectively …

Salvați Citați Citat de 802 ori Articole cu conținut similar Toate cele 9 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Scaling rectified flow transformers for high-resolution image synthesis

P Esser, S Kulal, A Blattmann, R Entezari… - … on machine learning, 2024 - openreview.net

Diffusion models create data from noise by inverting the forward paths of data towards noise
and have emerged as a powerful generative modeling technique for high-dimensional …

Salvați Citați Citat de 698 ori Articole cu conținut similar Toate cele 6 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vila: On pre-training for visual language models

J Lin, H Yin, W **, P Molchanov… - Proceedings of the …, 2024 - openaccess.thecvf.com

Visual language models (VLMs) rapidly progressed with the recent success of large
language models. There have been growing efforts on visual instruction tuning to extend the …

Salvați Citați Citat de 300 ori Articole cu conținut similar Toate cele 9 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

What matters when building vision-language models?

H Laurençon, L Tronchon, M Cord… - Advances in Neural …, 2025 - proceedings.neurips.cc

The growing interest in vision-language models (VLMs) has been driven by improvements in
large language models and vision transformers. Despite the abundance of literature on this …

Salvați Citați Citat de 164 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

Salvați Citați Citat de 266 ori Articole cu conținut similar Toate cele 6 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Salvați Citați Citat de 394 ori Articole cu conținut similar Toate cele 4 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Salvați Citați Citat de 209 ori Articole cu conținut similar Toate cele 6 versiuni Afișare ca HTML

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

Cogvlm: Visual expert for pretrained language models

A Survey of Multimodel Large Language Models

Mm-llms: Recent advances in multimodal large language models

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Mmbench: Is your multi-modal model an all-around player?

Scaling rectified flow transformers for high-resolution image synthesis

Vila: On pre-training for visual language models

What matters when building vision-language models?

Cogagent: A visual language model for gui agents

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Generative multimodal models are in-context learners