Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Benchmark evaluations, applications, and challenges of large vision language models: A survey

Z Li, X Wu, H Du, H Nghiem, G Shi - arxiv preprint arxiv:2501.02189, 2025 - arxiv.org
Multimodal Vision Language Models (VLMs) have emerged as a transformative technology
at the intersection of computer vision and natural language processing, enabling machines …

Mini-InternVL: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Z Gao, Z Chen, E Cui, Y Ren, W Wang, J Zhu, H Tian… - Visual Intelligence, 2024 - Springer
Multi-modal large language models (MLLMs) have demonstrated impressive performance in
vision-language tasks across a wide range of domains. However, the large model scale and …

Enhancing the reasoning ability of multimodal large language models via mixed preference optimization

W Wang, Z Chen, W Wang, Y Cao, Y Liu, Z Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing open-source multimodal large language models (MLLMs) generally follow a
training process involving pre-training and supervised fine-tuning. However, these models …

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

J Han, J Liu, Y Jiang, B Yan, Y Zhang, Z Yuan… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-
resolution, photorealistic images following language instruction. Infinity redefines visual …

In-context lora for diffusion transformers

L Huang, W Wang, ZF Wu, Y Shi, H Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent research arxiv: 2410.15027 has explored the use of diffusion transformers (DiTs) for
task-agnostic image generation by simply concatenating attention tokens across images …

Tokenflow: Unified image tokenizer for multimodal understanding and generation

L Qu, H Zhang, Y Liu, X Wang, Y Jiang, Y Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap
between multimodal understanding and generation. Prior research attempt to employ a …

Metamorph: Multimodal understanding and generation via instruction tuning

S Tong, D Fan, J Zhu, Y **ong, X Chen, K Sinha… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple and effective
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …

Janus-pro: Unified multimodal understanding and generation with data and model scaling

X Chen, Z Wu, X Liu, Z Pan, W Liu, Z **e, X Yu… - arxiv preprint arxiv …, 2025 - arxiv.org
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus.
Specifically, Janus-Pro incorporates (1) an optimized training strategy,(2) expanded training …

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Y Ma, X Liu, X Chen, W Liu, C Wu, Z Wu, Z Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …