Muse: Text-to-image generation via masked generative transformers
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image
generation performance while being significantly more efficient than diffusion or …
generation performance while being significantly more efficient than diffusion or …
4m: Massively multimodal masked modeling
Current machine learning models for vision are often highly specialized and limited to a
single modality and task. In contrast, recent large language models exhibit a wide range of …
single modality and task. In contrast, recent large language models exhibit a wide range of …
Learning vision from models rivals learning vision from data
We introduce SynCLR a novel approach for learning visual representations exclusively from
synthetic images without any real data. We synthesize a large dataset of image captions …
synthetic images without any real data. We synthesize a large dataset of image captions …
Givt: Generative infinite-vocabulary transformers
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …
vector sequences with real-valued entries, instead of discrete tokens from a finite …
Is sora a world simulator? a comprehensive survey on general world models and beyond
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …
Revisiting non-autoregressive transformers for efficient image synthesis
The field of image synthesis is currently flourishing due to the advancements in diffusion
models. While diffusion models have been successful their computational intensity has …
models. While diffusion models have been successful their computational intensity has …
Representation alignment for generation: Training diffusion transformers is easier than you think
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …
induce meaningful (discriminative) representations inside the model, though the quality of …
Momask: Generative masked modeling of 3d human motions
We introduce MoMask a novel masked modeling framework for text-driven 3D human
motion generation. In MoMask a hierarchical quantization scheme is employed to represent …
motion generation. In MoMask a hierarchical quantization scheme is employed to represent …
Masked modeling for self-supervised representation learning on vision and beyond
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …
increasing attention in recent years thanks to its remarkable representation learning ability …
Vpp: Efficient conditional 3d generation via voxel-point progressive representation
Conditional 3D generation is undergoing a significant advancement, enabling the free
creation of 3D content from inputs such as text or 2D images. However, previous …
creation of 3D content from inputs such as text or 2D images. However, previous …