Givt: Generative infinite-vocabulary transformers

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2024 - Springer
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining

D Liu, S Zhao, L Zhuo, W Lin, Y Qiao, H Li… - arxiv preprint arxiv …, 2024 - arxiv.org
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various
vision and language tasks, particularly excelling in generating flexible photorealistic images …

Maskbit: Embedding-free image generation via bit tokens

M Weber, L Yu, Q Yu, X Deng, X Shen… - arxiv preprint arxiv …, 2024 - arxiv.org
Masked transformer models for class-conditional image generation have become a
compelling alternative to diffusion models. Typically comprising two stages-an initial VQGAN …

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng, L Wang - arxiv preprint arxiv:2404.02905, 2024 - arxiv.org
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Y Wang, H Zhan, L Liu, R Zeng, H Guo, J Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive
and non-autoregressive systems. The autoregressive systems implicitly model duration but …

Adanat: Exploring adaptive policy for token-based image generation

Z Ni, Y Wang, R Zhou, R Lu, J Guo, J Hu, Z Liu… - … on Computer Vision, 2024 - Springer
Recent studies have demonstrated the effectiveness of token-based methods for visual
content generation. As a representative work, non-autoregressive Transformers (NATs) are …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Quest: Self-supervised skill abstractions for learning continuous control

A Mete, H Xue, A Wilcox, Y Chen, A Garg - arxiv preprint arxiv …, 2024 - arxiv.org
Generalization capabilities, or rather a lack thereof, is one of the most important unsolved
problems in the field of robot learning, and while several large scale efforts have set out to …

Vector Quantization for Recommender Systems: A Review and Outlook

Q Liu, X Dong, J **ao, N Chen, H Hu, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Vector quantization, renowned for its unparalleled feature compression capabilities, has
been a prominent topic in signal processing and machine learning research for several …