Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Unidream: Unifying diffusion priors for relightable text-to-3d generation

Z Liu, Y Li, Y Lin, X Yu, S Peng, YP Cao, X Qi… - … on Computer Vision, 2024 - Springer
Recent advancements in text-to-3D generation technology have significantly advanced the
conversion of textual descriptions into imaginative well-geometrical and finely textured 3D …

Sana: Efficient high-resolution image synthesis with linear diffusion transformers

E **e, J Chen, J Chen, H Cai, H Tang, Y Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Sana, a text-to-image framework that can efficiently generate images up to
4096$\times $4096 resolution. Sana can synthesize high-resolution, high-quality images …

Janus-pro: Unified multimodal understanding and generation with data and model scaling

X Chen, Z Wu, X Liu, Z Pan, W Liu, Z **e, X Yu… - arxiv preprint arxiv …, 2025 - arxiv.org
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus.
Specifically, Janus-Pro incorporates (1) an optimized training strategy,(2) expanded training …

PixWizard: Versatile image-to-image visual assistant with open-language instructions

W Lin, X Wei, R Zhang, L Zhuo, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper presents a versatile image-to-image visual assistant, PixWizard, designed for
image generation, manipulation, and translation based on free-from language instructions …

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Y Ma, X Liu, X Chen, W Liu, C Wu, Z Wu, Z Pan… - arxiv preprint arxiv …, 2024 - arxiv.org
We present JanusFlow, a powerful framework that unifies image understanding and
generation in a single model. JanusFlow introduces a minimalist architecture that integrates …

Customize your visual autoregressive recipe with set autoregressive modeling

W Liu, L Zhuo, Y **n, S **a, P Gao, X Yue - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set
AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set …

SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers

E **e, J Chen, J Chen, H Cai, H Tang, Y Lin… - The Thirteenth …, 2025 - openreview.net
We introduce Sana, a text-to-image framework that can efficiently generate images up to
4096$\times $4096 resolution. Sana can synthesize high-resolution, high-quality images …

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

D Hu, J Chen, X Huang, H Coskun, A Sahni… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing text-to-image (T2I) diffusion models face several limitations, including large model
sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to …