Givt: Generative infinite-vocabulary transformers
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …
vector sequences with real-valued entries, instead of discrete tokens from a finite …
Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …
vision and language tasks, particularly excelling in generating flexible photorealistic images …
Maskbit: Embedding-free image generation via bit tokens
Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …
Emova: Empowering language models to see, hear and speak with vivid emotions
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
tones, marks a milestone for omni-modal foundation models. However, empowering Large …
Visual autoregressive modeling: Scalable image generation via next-scale prediction
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …
Maskgct: Zero-shot text-to-speech with masked generative codec transformer
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive
and non-autoregressive systems. The autoregressive systems implicitly model duration but …
and non-autoregressive systems. The autoregressive systems implicitly model duration but …
Adanat: Exploring adaptive policy for token-based image generation
Recent studies have demonstrated the effectiveness of token-based methods for visual
content generation. As a representative work, non-autoregressive Transformers (NATs) are …
content generation. As a representative work, non-autoregressive Transformers (NATs) are …
Wavchat: A survey of spoken dialogue models
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …
have captured significant attention in the speech domain. Compared to traditional three-tier …
Quest: Self-supervised skill abstractions for learning continuous control
Generalization capabilities, or rather a lack thereof, is one of the most important unsolved
problems in the field of robot learning, and while several large scale efforts have set out to …
problems in the field of robot learning, and while several large scale efforts have set out to …
Vector Quantization for Recommender Systems: A Review and Outlook
Vector quantization, renowned for its unparalleled feature compression capabilities, has
been a prominent topic in signal processing and machine learning research for several …
been a prominent topic in signal processing and machine learning research for several …