Emu3: Next-token prediction is all you need
While next-token prediction is considered a promising path towards artificial general
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
intelligence, it has struggled to excel in multimodal tasks, which are still dominated by …
Show-o: One single transformer to unify multimodal understanding and generation
We present a unified transformer, ie, Show-o, that unifies multimodal understanding and
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
generation. Unlike fully autoregressive models, Show-o unifies autoregressive and …
Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …
vision and language tasks, particularly excelling in generating flexible photorealistic images …
Janus: Decoupling visual encoding for unified multimodal understanding and generation
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal
understanding and generation. Prior research often relies on a single visual encoder for …
understanding and generation. Prior research often relies on a single visual encoder for …
Loong: Generating minute-level long videos with autoregressive language models
It is desirable but challenging to generate content-rich long videos in the scale of minutes.
Autoregressive large language models (LLMs) have achieved great success in generating …
Autoregressive large language models (LLMs) have achieved great success in generating …
Open-magvit2: An open-source project toward democratizing auto-regressive visual generation
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging
from 300M to 1.5 B. The Open-MAGVIT2 project produces an open-source replication of …
from 300M to 1.5 B. The Open-MAGVIT2 project produces an open-source replication of …
Maskbit: Embedding-free image generation via bit tokens
Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Dart: Denoising autoregressive transformer for scalable text-to-image generation
Diffusion models have become the dominant approach for visual generation. They are
trained by denoising a Markovian process which gradually adds noise to the input. We …
trained by denoising a Markovian process which gradually adds noise to the input. We …
Randomized autoregressive visual generation
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation,
which sets a new state-of-the-art performance on the image generation task while …
which sets a new state-of-the-art performance on the image generation task while …