Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Imagefolder: Autoregressive image generation with folded tokens
Image tokenizers are crucial for visual generative models, eg, diffusion models (DMs) and
autoregressive (AR) models, as they construct the latent representation for modeling …
autoregressive (AR) models, as they construct the latent representation for modeling …
Randomized autoregressive visual generation
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation,
which sets a new state-of-the-art performance on the image generation task while …
which sets a new state-of-the-art performance on the image generation task while …
Taming scalable visual tokenizer for autoregressive image generation
Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the
instability of the codebook that undergoes partial updates during training. The codebook is …
instability of the codebook that undergoes partial updates during training. The codebook is …
Language-Guided Image Tokenization for Generation
Image tokenization, the process of transforming raw image pixels into a compact low-
dimensional latent representation, has proven crucial for scalable and efficient image …
dimensional latent representation, has proven crucial for scalable and efficient image …
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Autoregression in large language models (LLMs) has shown impressive scalability by
unifying all language tasks into the next token prediction paradigm. Recently, there is a …
unifying all language tasks into the next token prediction paradigm. Recently, there is a …
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Image tokenizers form the foundation of modern text-to-image generative models but are
notoriously difficult to train. Furthermore, most existing text-to-image models rely on large …
notoriously difficult to train. Furthermore, most existing text-to-image models rely on large …
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
We introduce RandAR, a decoder-only visual autoregressive (AR) model capable of
generating images in arbitrary token orders. Unlike previous decoder-only AR models that …
generating images in arbitrary token orders. Unlike previous decoder-only AR models that …
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
Autoregressive (AR) modeling has achieved remarkable success in natural language
processing by enabling models to generate text with coherence and contextual …
processing by enabling models to generate text with coherence and contextual …
Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models
X Ding, S Cao, T Cao, Z Chen - arxiv preprint arxiv:2501.06218, 2025 - arxiv.org
Vision generative models have recently made significant advancements along two primary
paradigms: diffusion-style and language-style, both of which have demonstrated excellent …
paradigms: diffusion-style and language-style, both of which have demonstrated excellent …