Muse: Text-to-image generation via masked generative transformers

H Chang, H Zhang, J Barber, AJ Maschinot… - arxiv preprint arxiv …, 2023 - arxiv.org
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image
generation performance while being significantly more efficient than diffusion or …

4m: Massively multimodal masked modeling

D Mizrahi, R Bachmann, O Kar, T Yeo… - Advances in …, 2023 - proceedings.neurips.cc
Current machine learning models for vision are often highly specialized and limited to a
single modality and task. In contrast, recent large language models exhibit a wide range of …

Learning vision from models rivals learning vision from data

Y Tian, L Fan, K Chen, D Katabi… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce SynCLR a novel approach for learning visual representations exclusively from
synthetic images without any real data. We synthesize a large dataset of image captions …

Givt: Generative infinite-vocabulary transformers

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2024 - Springer
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …

Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Revisiting non-autoregressive transformers for efficient image synthesis

Z Ni, Y Wang, R Zhou, J Guo, J Hu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The field of image synthesis is currently flourishing due to the advancements in diffusion
models. While diffusion models have been successful their computational intensity has …

Representation alignment for generation: Training diffusion transformers is easier than you think

S Yu, S Kwak, H Jang, J Jeong, J Huang, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …

Momask: Generative masked modeling of 3d human motions

C Guo, Y Mu, MG Javed, S Wang… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MoMask a novel masked modeling framework for text-driven 3D human
motion generation. In MoMask a hierarchical quantization scheme is employed to represent …

Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J **a… - arxiv preprint arxiv …, 2023 - arxiv.org
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

Vpp: Efficient conditional 3d generation via voxel-point progressive representation

Z Qi, M Yu, R Dong, K Ma - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Conditional 3D generation is undergoing a significant advancement, enabling the free
creation of 3D content from inputs such as text or 2D images. However, previous …