Google Académico

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org

In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Guardar Citar Citado por 204 Artículos relacionados Las 2 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Guardar Citar Citado por 134 Artículos relacionados Las 2 versiones

[Free GPT-4]

[PDF] neurips.cc

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc

Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Guardar Citar Citado por 4995 Artículos relacionados Las 15 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arxiv preprint arxiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Guardar Citar Citado por 3540 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Minigpt-4: Enhancing vision-language understanding with advanced large language models

D Zhu, J Chen, X Shen, X Li, M Elhoseiny - arxiv preprint arxiv …, 2023 - arxiv.org

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …

Guardar Citar Citado por 2430 Artículos relacionados Las 7 versiones Versión en HTML

[Free GPT-4]

[PDF] researchhub.com

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J Bai, S Bai, S Yang, S Wang… - arxiv preprint …, 2023 - storage.prod.researchhub.com

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

Guardar Citar Citado por 539 Artículos relacionados Versión en HTML

[Free GPT-4]

[PDF] thecvf.com

Image as a foreign language: Beit pretraining for vision and vision-language tasks

W Wang, H Bao, L Dong, J Bjorck… - Proceedings of the …, 2023 - openaccess.thecvf.com

A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …

Guardar Citar Citado por 449 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]

[PDF] neurips.cc

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2023 - proceedings.neurips.cc

A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

Guardar Citar Citado por 478 Artículos relacionados Las 5 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Mimic-it: Multi-modal in-context instruction tuning

B Li, Y Zhang, L Chen, J Wang, F Pu, J Yang… - arxiv preprint arxiv …, 2023 - arxiv.org

High-quality instructions and responses are essential for the zero-shot performance of large
language models on interactive natural language tasks. For interactive vision-language …

Guardar Citar Citado por 637 Artículos relacionados Las 4 versiones Versión en HTML

[Free GPT-4]

[PDF] arxiv.org

Llama-adapter: Efficient fine-tuning of language models with zero-init attention

R Zhang, J Han, C Liu, P Gao, A Zhou, X Hu… - arxiv preprint arxiv …, 2023 - arxiv.org

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …

Guardar Citar Citado por 730 Artículos relacionados Las 3 versiones Versión en HTML

Crear alerta

Citar

Búsqueda avanzada

Guardado en Mi biblioteca

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Mm-llms: Recent advances in multimodal large language models

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Visual instruction tuning

A survey of large language models

Minigpt-4: Enhancing vision-language understanding with advanced large language models

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

Image as a foreign language: Beit pretraining for vision and vision-language tasks

Language is not all you need: Aligning perception with language models

Mimic-it: Multi-modal in-context instruction tuning

Llama-adapter: Efficient fine-tuning of language models with zero-init attention