Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Imagefolder: Autoregressive image generation with folded tokens

X Li, K Qiu, H Chen, J Kuen, J Gu, B Raj… - arxiv preprint arxiv …, 2024 - arxiv.org
Image tokenizers are crucial for visual generative models, eg, diffusion models (DMs) and
autoregressive (AR) models, as they construct the latent representation for modeling …

Randomized autoregressive visual generation

Q Yu, J He, X Deng, X Shen, LC Chen - arxiv preprint arxiv:2411.00776, 2024 - arxiv.org
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation,
which sets a new state-of-the-art performance on the image generation task while …

Taming scalable visual tokenizer for autoregressive image generation

F Shi, Z Luo, Y Ge, Y Yang, Y Shan, L Wang - arxiv preprint arxiv …, 2024 - arxiv.org
Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the
instability of the codebook that undergoes partial updates during training. The codebook is …

Language-Guided Image Tokenization for Generation

K Zha, L Yu, A Fathi, DA Ross, C Schmid… - arxiv preprint arxiv …, 2024 - arxiv.org
Image tokenization, the process of transforming raw image pixels into a compact low-
dimensional latent representation, has proven crucial for scalable and efficient image …

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

S **e, W Zu, M Zhao, D Su, S Liu, R Shi, G Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Autoregression in large language models (LLMs) has shown impressive scalability by
unifying all language tasks into the next token prediction paradigm. Recently, there is a …

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

D Kim, J He, Q Yu, C Yang, X Shen, S Kwak… - arxiv preprint arxiv …, 2025 - arxiv.org
Image tokenizers form the foundation of modern text-to-image generative models but are
notoriously difficult to train. Furthermore, most existing text-to-image models rely on large …

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Z Pang, T Zhang, F Luan, Y Man, H Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of
generating images in arbitrary token orders. Unlike previous decoder-only AR models that …

FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

S Ren, Q Yu, J He, X Shen, A Yuille… - arxiv preprint arxiv …, 2024 - arxiv.org
Autoregressive (AR) modeling has achieved remarkable success in natural language
processing by enabling models to generate text with coherence and contextual …

Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models

X Ding, S Cao, T Cao, Z Chen - arxiv preprint arxiv:2501.06218, 2025 - arxiv.org
Vision generative models have recently made significant advancements along two primary
paradigms: diffusion-style and language-style, both of which have demonstrated excellent …