Prefixkv: Adaptive prefix kv cache is what vision instruction-following models need for efficient generation

A Wang, H Chen, J Tan, K Zhang, X Cai, Z Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, large vision-language models (LVLMs) have rapidly gained popularity for their
strong generation and reasoning capabilities given diverse multimodal inputs. However …

A Survey on Inference Optimization Techniques for Mixture of Experts Models

J Liu, P Tang, W Wang, Y Ren, X Hou, PA Heng… - arxiv preprint arxiv …, 2024 - arxiv.org
The emergence of large-scale Mixture of Experts (MoE) models has marked a significant
advancement in artificial intelligence, offering enhanced model capacity and computational …

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Y Cai, J Zhang, H He, X He, A Tong, Z Gan… - arxiv preprint arxiv …, 2024 - arxiv.org
The success of Large Language Models (LLM) has led researchers to explore Multimodal
Large Language Models (MLLM) for unified visual and linguistic understanding. However …

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models

W Wang, Z Li, Q Xu, L Li, YQ Cai, B Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal large language models (MLLMs) have achieved remarkable success in fine-
grained visual understanding across a range of tasks. However, they often encounter …

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

BK Lee, R Hachiuma, YCF Wang, YM Ro… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent surge in high-quality visual instruction tuning samples from closed-source vision-
language models (VLMs) such as GPT-4V has accelerated the release of open-source …

MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders

J Cao, Y Zhang, T Huang, M Lu, Q Zhang, R An… - arxiv preprint arxiv …, 2025 - arxiv.org
Visual encoders are fundamental components in vision-language models (VLMs), each
showcasing unique strengths derived from various pre-trained visual foundation models. To …

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

Q Feng, W Li, T Lin, X Chen - arxiv preprint arxiv:2412.01282, 2024 - arxiv.org
Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities
to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile …

A Framework of Distilling Multimodal Large Language Models

J Zhang, H He, X He, A Tong, Z Gan, C Wang, X Bai - openreview.net
The success of Large Language Models (LLM) has led researchers to explore Multimodal
Large Language Models (MLLM) for unified visual and linguistic understanding. However …

[PDF][PDF] Learning to Inference Adaptively for Multimodal Large Language Models

Z Xu, KD Nguyen, P Mukherjee, S Chaterji, S Bagchi… - pages.cs.wisc.edu
Abstract Multimodal Large Language Models (MLLMs) have shown impressive capabilities
in reasoning, yet come with substantial computational cost, limiting their deployment in …

[PDF][PDF] Towards Better Adaptation of Foundation Models

Z Xu - pages.cs.wisc.edu
Foundation models have revolutionized artificial intelligence, yet fundamental challenges
remain in understanding and optimizing their capabilities in adaptation and inference. This …