Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge

W Wang, L Ding, L Shen, Y Luo, H Hu… - Proceedings of the 32nd …, 2024 - dl.acm.org
Multimodal Sentiment Analysis (MSA) focuses on leveraging multimodal signals for
understanding human sentiment. Most of the existing works rely on superficial information …

Fire: A dataset for feedback integration and refinement evaluation of multimodal models

P Li, Z Gao, B Zhang, T Yuan, Y Wu, M Harandi… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision language models (VLMs) have achieved impressive progress in diverse applications,
becoming a prevalent research direction. In this paper, we build FIRE, a feedback …

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

F Zhu, Z Liu, XY Ng, H Wu, W Wang, F Feng… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many
vision-language tasks, yet their capabilities in fine-grained visual understanding remain …

VisionArena: 230K Real World User-VLM Conversations with Preference Labels

C Chou, L Dunlap, K Mashita, K Mandal… - arxiv preprint arxiv …, 2024 - arxiv.org
With the growing adoption and capabilities of vision-language models (VLMs) comes the
need for benchmarks that capture authentic user-VLM interactions. In response, we create …

Enhancing Perception Capabilities of Multimodal LLMs with Training-free Fusion

Z Chen, J Hu, Z Deng, Y Wang, B Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision
encoders with language models. Existing methods to enhance the visual perception of …

Hummingbird: High Fidelity Image Generation via Multimodal Context Alignment

MQ Le, G Mittal, T Meng, ASM Iftekhar… - arxiv preprint arxiv …, 2025 - arxiv.org
While diffusion models are powerful in generating high-quality, diverse synthetic data for
object-centric tasks, existing methods struggle with scene-aware tasks such as Visual …

HumanVLM: Foundation for Human-Scene Vision-Language Model

D Dai, X Long, L Yutang, Z Yuanhui, S **a - arxiv preprint arxiv …, 2024 - arxiv.org
Human-scene vision-language tasks are increasingly prevalent in diverse social
applications, yet recent advancements predominantly rely on models specifically tailored to …

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?

S Lee, G Kim, J Kim, H Lee, H Chang, SH Park… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language adaptation (VL adaptation) transforms Large Language Models (LLMs)
into Large Vision-Language Models (LVLMs) for multimodal tasks, but this process often …

Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs

R Saxena, AP Gema, P Minervini - arxiv preprint arxiv:2502.05092, 2025 - arxiv.org
Understanding time from visual representations is a fundamental cognitive skill, yet it
remains a challenge for multimodal large language models (MLLMs). In this work, we …