- Academic Search

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Enregistrer Citer Cité 204 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] tandfonline.com

Real-world robot applications of foundation models: A review

K Kawaharazuka, T Matsushima… - Advanced …, 2024 - Taylor & Francis

Recent developments in foundation models, like Large Language Models (LLMs) and Vision-
Language Models (VLMs), trained on extensive data, facilitate flexible application across …

Enregistrer Citer Cité 39 fois Autres articles Les 2 versions Free GPT-4

[Free GPT-4]

[PDF] thecvf.com

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com

Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

Enregistrer Citer Cité 1745 fois Autres articles Les 5 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] researchhub.com

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J Bai, S Bai, S Yang, S Wang… - arxiv preprint …, 2023 - storage.prod.researchhub.com

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

Enregistrer Citer Cité 539 fois Autres articles Version HTML

[Free GPT-4]

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2023 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

Enregistrer Citer Cité 478 fois Autres articles Les 5 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] thecvf.com

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com

We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

Enregistrer Citer Cité 526 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Enregistrer Citer Cité 179 fois Autres articles Les 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Open x-embodiment: Robotic learning datasets and rt-x models

A O'Neill, A Rehman, A Gupta, A Maddukuri… - arxiv preprint arxiv …, 2023 - arxiv.org

Large, high-capacity models trained on diverse datasets have shown remarkable successes
on efficiently tackling downstream applications. In domains from NLP to Computer Vision …

Enregistrer Citer Cité 294 fois Autres articles Les 2 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi, Y Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular shallow alignment method which maps image features into the input space …

Enregistrer Citer Cité 551 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] thecvf.com

Cogagent: A visual language model for gui agents

W Hong, W Wang, Q Lv, J Xu, W Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

People are spending an enormous amount of time on digital devices through graphical user
interfaces (GUIs) eg computer or smartphone screens. Large language models (LLMs) such …

Enregistrer Citer Cité 236 fois Autres articles Les 3 versions Free GPT-4 Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Pali-x: On scaling up a multilingual vision and language model

Mm-llms: Recent advances in multimodal large language models

Real-world robot applications of foundation models: A review

Improved baselines with visual instruction tuning

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Language is not all you need: Aligning perception with language models

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

MM1: methods, analysis and insights from multimodal LLM pre-training

Open x-embodiment: Robotic learning datasets and rt-x models

Cogvlm: Visual expert for pretrained language models

Cogagent: A visual language model for gui agents