Emu3: Next-token prediction is all you need

X Wang, X Zhang, Z Luo, Q Sun, Y Cui, J Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …

Show-o: One single transformer to unify multimodal understanding and generation

J **e, W Mao, Z Bai, DJ Zhang, W Wang, KQ Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Janus: Decoupling visual encoding for unified multimodal understanding and generation

C Wu, X Chen, Z Wu, Y Ma, X Liu, Z Pan, W Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …

Loong: Generating minute-level long videos with autoregressive language models

Y Wang, T **ong, D Zhou, Z Lin, Y Zhao, B Kang… - arxiv preprint arxiv …, 2024 - arxiv.org
It is desirable but challenging to generate content-rich long videos in the scale of minutes.
Autoregressive large language models (LLMs) have achieved great success in generating …

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation

Z Luo, F Shi, Y Ge, Y Yang, L Wang, Y Shan - arxiv preprint arxiv …, 2024 - arxiv.org
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging
from 300M to 1.5 B. The Open-MAGVIT2 project produces an open-source replication of …

Maskbit: Embedding-free image generation via bit tokens

M Weber, L Yu, Q Yu, X Deng, X Shen… - arxiv preprint arxiv …, 2024 - arxiv.org
Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Dart: Denoising autoregressive transformer for scalable text-to-image generation

J Gu, Y Wang, Y Zhang, Q Zhang, D Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion models have become the dominant approach for visual generation. They are
trained by denoising a Markovian process which gradually adds noise to the input. We …

Randomized autoregressive visual generation

Q Yu, J He, X Deng, X Shen, LC Chen - arxiv preprint arxiv:2411.00776, 2024 - arxiv.org
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation,
which sets a new state-of-the-art performance on the image generation task while …