Meteor: Mamba-based traversal of rationale for large language and vision models

BK Lee, CW Kim, B Park, YM Ro - Advances in Neural …, 2025‏ - proceedings.neurips.cc
The rapid development of large language and vision models (LLVMs) has been driven by
advances in visual instruction tuning. Recently, open-source LLVMs have curated high …

Efficient multimodal large language models: A survey

Y **, J Li, Y Liu, T Gu, K Wu, Z Jiang, M He… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In the past year, Multimodal Large Language Models (MLLMs) have demonstrated
remarkable performance in tasks such as visual question answering, visual understanding …

Eagle: Exploring the design space for multimodal llms with mixture of encoders

M Shi, F Liu, S Wang, S Liao, S Radhakrishnan… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The ability to accurately interpret complex visual information is a crucial topic of multimodal
large language models (MLLMs). Recent work indicates that enhanced visual perception …

Learning visual prompts for guiding the attention of vision transformers

R Rezaei, MJ Sabet, J Gu, D Rueckert, P Torr… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Visual prompting infuses visual information into the input image to adapt models toward
specific predictions and tasks. Recently, manually crafted markers such as red circles are …

Metamorph: Multimodal understanding and generation via instruction tuning

S Tong, D Fan, J Zhu, Y **ong, X Chen, K Sinha… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple and effective
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …

Diffusion feedback helps clip see better

W Wang, Q Sun, F Zhang, Y Tang, J Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …

Trol: Traversal of layers for large language and vision models

BK Lee, S Chung, CW Kim, B Park, YM Ro - arxiv preprint arxiv …, 2024‏ - arxiv.org
Large language and vision models (LLVMs) have been driven by the generalization power
of large language models (LLMs) and the advent of visual instruction tuning. Along with …

Phantom of latent for large language and vision models

BK Lee, S Chung, CW Kim, B Park, YM Ro - arxiv preprint arxiv …, 2024‏ - arxiv.org
The success of visual instruction tuning has accelerated the development of large language
and vision models (LLVMs). Following the scaling laws of instruction-tuned large language …

Paligemma 2: A family of versatile vlms for transfer

A Steiner, AS Pinto, M Tschannen, D Keysers… - arxiv preprint arxiv …, 2024‏ - arxiv.org
PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based
on the Gemma 2 family of language models. We combine the SigLIP-So400m vision …

On Erroneous Agreements of CLIP Image Embeddings

S Li, PW Koh, SS Du - arxiv preprint arxiv:2411.05195, 2024‏ - arxiv.org
Recent research suggests that the failures of Vision-Language Models (VLMs) at visual
reasoning often stem from erroneous agreements--when semantically distinct images are …