Академия Google

Z Liang, Y Xu, Y Hong, P Shang, Q Wang… - Proceedings of the 3rd …, 2024 - dl.acm.org

With the widespread application of the Transformer architecture in various modalities,
including vision, the technology of large language models is evolving from a single modality …

Сохранить Цитировать Цитируется: 1233 Похожие статьи Все версии статьи (12)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Instruction tuning for large language models: A survey

S Zhang, L Dong, X Li, S Zhang, X Sun, S Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

This paper surveys research works in the quickly advancing field of instruction tuning (IT),
which can also be referred to as supervised fine-tuning (SFT)\footnote {In this paper, unless …

Сохранить Цитировать Цитируется: 744 Похожие статьи Все версии статьи (5) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi… - Advances in …, 2025 - proceedings.neurips.cc

We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular\emph {shallow alignment} method which maps image features into the …

Сохранить Цитировать Цитируется: 582 Похожие статьи Все версии статьи (5) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] nowpublishers.com

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com

Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Сохранить Цитировать Цитируется: 229 Похожие статьи Все версии статьи (7) Поиск в библиотеках В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Generative multimodal models are in-context learners

Q Sun, Y Cui, X Zhang, F Zhang, Q Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com

Humans can easily solve multimodal tasks in context with only a few demonstrations or
simple instructions which current multimodal systems largely struggle to imitate. In this work …

Сохранить Цитировать Цитируется: 208 Похожие статьи Все версии статьи (6) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer

In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Сохранить Цитировать Цитируется: 198 Похожие статьи Все версии статьи (7)

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Glamm: Pixel grounding large multimodal model

H Rasheed, M Maaz, S Shaji… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual …

Сохранить Цитировать Цитируется: 164 Похожие статьи Все версии статьи (9) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

J Lu, C Clark, S Lee, Z Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

We present Unified-IO 2 a multimodal and multi-skill unified model capable of following
novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can …

Сохранить Цитировать Цитируется: 121 Похожие статьи Все версии статьи (7) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Honeybee: Locality-enhanced projector for multimodal llm

J Cha, W Kang, J Mun, B Roh - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial
role in bridging pre-trained vision encoders with LLMs enabling profound visual …

Сохранить Цитировать Цитируется: 102 Похожие статьи Все версии статьи (6) В виде HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Vip-llava: Making large multimodal models understand arbitrary visual prompts

M Cai, H Liu, SK Mustikovela… - Proceedings of the …, 2024 - openaccess.thecvf.com

While existing large vision-language multimodal models focus on whole image
understanding there is a prominent gap in achieving region-specific comprehension …

Сохранить Цитировать Цитируется: 85 Похожие статьи Все версии статьи (7) В виде HTML

Создать оповещение

Цитировать

Расширенный поиск

Сохранено в вашей библиотеке

Ferret: Refer and ground anything anywhere at any granularity

A Survey of Multimodel Large Language Models

Instruction tuning for large language models: A survey

Cogvlm: Visual expert for pretrained language models

Multimodal foundation models: From specialists to general-purpose assistants

Generative multimodal models are in-context learners

MM1: methods, analysis and insights from multimodal LLM pre-training

Glamm: Pixel grounding large multimodal model

Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action

Honeybee: Locality-enhanced projector for multimodal llm

Vip-llava: Making large multimodal models understand arbitrary visual prompts