Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Z Wu, X Chen, Z Pan, X Liu, W Liu, D Dai… - arxiv preprint arxiv …, 2024 - arxiv.org
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …

A survey on multimodal benchmarks: In the era of large ai models

L Li, G Chen, H Shi, J **ao, L Chen - arxiv preprint arxiv:2409.18142, 2024 - arxiv.org
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Y Ma, X Liu, X Chen, W Liu, C Wu, Z Wu, Z Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …

Questioning, Answering, and Captioning for Zero-Shot Detailed Image Caption

DT Luu, VT Le, DM Vo - Proceedings of the Asian …, 2024 - openaccess.thecvf.com
End-to-end pre-trained large vision language models (VLMs) have made unprecedented
progress in image captioning. Nonetheless, they struggle to generate detailed captions …

VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use

Z Zhang, R Rossi, T Yu, F Dernoncourt… - arxiv preprint arxiv …, 2024 - arxiv.org
While vision-language models (VLMs) have demonstrated remarkable performance across
various tasks combining textual and visual information, they continue to struggle with fine …

Ocean-OCR: Towards General OCR Application via a Vision-Language Model

S Chen, X Guo, Y Li, T Zhang, M Lin, D Kuang… - arxiv preprint arxiv …, 2025 - arxiv.org
Multimodal large language models (MLLMs) have shown impressive capabilities across
various domains, excelling in processing and understanding information from multiple …

Baichuan-Omni-1.5 Technical Report

Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …

Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models

Z Li, G Chen, S Liu, S Wang, V VS, Y Ji, S Lan… - arxiv preprint arxiv …, 2025 - arxiv.org
Recently, promising progress has been made by open-source vision-language models
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …