Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Janus: Decoupling visual encoding for unified multimodal understanding and generation
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …
understanding and generation. Prior research often relies on a single visual encoder for …
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
Language Models that significantly improves upon its predecessor, DeepSeek-VL, through …
A survey on multimodal benchmarks: In the era of large ai models
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial
advancements in artificial intelligence, significantly enhancing the capability to understand …
advancements in artificial intelligence, significantly enhancing the capability to understand …
Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation
We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …
Questioning, Answering, and Captioning for Zero-Shot Detailed Image Caption
End-to-end pre-trained large vision language models (VLMs) have made unprecedented
progress in image captioning. Nonetheless, they struggle to generate detailed captions …
progress in image captioning. Nonetheless, they struggle to generate detailed captions …
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use
While vision-language models (VLMs) have demonstrated remarkable performance across
various tasks combining textual and visual information, they continue to struggle with fine …
various tasks combining textual and visual information, they continue to struggle with fine …
Ocean-OCR: Towards General OCR Application via a Vision-Language Model
S Chen, X Guo, Y Li, T Zhang, M Lin, D Kuang… - arxiv preprint arxiv …, 2025 - arxiv.org
Multimodal large language models (MLLMs) have shown impressive capabilities across
various domains, excelling in processing and understanding information from multiple …
various domains, excelling in processing and understanding information from multiple …
Baichuan-Omni-1.5 Technical Report
Y Li, J Liu, T Zhang, S Chen, T Li, Z Li, L Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce Baichuan-Omni-1.5, an omni-modal model that not only has omni-modal
understanding capabilities but also provides end-to-end audio generation capabilities. To …
understanding capabilities but also provides end-to-end audio generation capabilities. To …
Eagle 2: Building Post-Training Data Strategies from Scratch for Frontier Vision-Language Models
Recently, promising progress has been made by open-source vision-language models
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …
(VLMs) in bringing their capabilities closer to those of proprietary frontier models. However …