محقق Google

BK Lee, CW Kim, B Park, YM Ro - Advances in Neural …, 2025‏ - proceedings.neurips.cc‏

The rapid development of large language and vision models (LLVMs) has been driven by
advances in visual instruction tuning. Recently, open-source LLVMs have curated high …‏

ذخیره ارجاع بیان شده در 18 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Efficient multimodal large language models: A survey‏

Y **, J Li, Y Liu, T Gu, K Wu, Z Jiang, M He… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated
remarkable performance in tasks such as visual question answering, visual understanding …‏

ذخیره ارجاع بیان شده در 44 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eagle: Exploring the design space for multimodal llms with mixture of encoders‏

M Shi, F Liu, S Wang, S Liao, S Radhakrishnan… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …‏

ذخیره ارجاع بیان شده در 47 یافته مقاله‌های مربوط تمام نسخه‌های 4 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Learning visual prompts for guiding the attention of vision transformers‏

R Rezaei, MJ Sabet, J Gu, D Rueckert, P Torr… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Visual prompting infuses visual information into the input image to adapt models toward
specific predictions and tasks. Recently, manually crafted markers such as red circles are …‏

ذخیره ارجاع بیان شده در 6 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Metamorph: Multimodal understanding and generation via instruction tuning‏

S Tong, D Fan, J Zhu, Y **ong, X Chen, K Sinha… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple and effective
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …‏

ذخیره ارجاع بیان شده در 6 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Diffusion feedback helps clip see better‏

W Wang, Q Sun, F Zhang, Y Tang, J Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …‏

ذخیره ارجاع بیان شده در 11 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Trol: Traversal of layers for large language and vision models‏

BK Lee, S Chung, CW Kim, B Park, YM Ro - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Large language and vision models (LLVMs) have been driven by the generalization power
of large language models (LLMs) and the advent of visual instruction tuning. Along with …‏

ذخیره ارجاع بیان شده در 5 یافته مقاله‌های مربوط تمام نسخه‌های 7 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Phantom of latent for large language and vision models‏

BK Lee, S Chung, CW Kim, B Park, YM Ro - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The success of visual instruction tuning has accelerated the development of large language
and vision models (LLVMs). Following the scaling laws of instruction-tuned large language …‏

ذخیره ارجاع بیان شده در 5 یافته مقاله‌های مربوط تمام نسخه‌های 3 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Paligemma 2: A family of versatile vlms for transfer‏

A Steiner, AS Pinto, M Tschannen, D Keysers… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based
on the Gemma 2 family of language models. We combine the SigLIP-So400m vision …‏

ذخیره ارجاع بیان شده در 5 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

On Erroneous Agreements of CLIP Image Embeddings‏

S Li, PW Koh, SS Du - arxiv preprint arxiv:2411.05195, 2024‏ - arxiv.org‏

Recent research suggests that the failures of Vision-Language Models (VLMs) at visual
reasoning often stem from erroneous agreements--when semantically distinct images are …‏

ذخیره ارجاع بیان شده در 2 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

BRAVE: Broadening the visual encoding of vision-language models

Meteor: Mamba-based traversal of rationale for large language and vision models‏

Efficient multimodal large language models: A survey‏

Eagle: Exploring the design space for multimodal llms with mixture of encoders‏

Learning visual prompts for guiding the attention of vision transformers‏

Metamorph: Multimodal understanding and generation via instruction tuning‏

Diffusion feedback helps clip see better‏

Trol: Traversal of layers for large language and vision models‏

Phantom of latent for large language and vision models‏

Paligemma 2: A family of versatile vlms for transfer‏

On Erroneous Agreements of CLIP Image Embeddings‏