Is sora a world simulator? a comprehensive survey on general world models and beyond
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …
Representation alignment for generation: Training diffusion transformers is easier than you think
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …
induce meaningful (discriminative) representations inside the model, though the quality of …
Self-rectifying diffusion sampling with perturbed-attention guidance
Recent studies have demonstrated that diffusion models can generate high-quality samples,
but their quality heavily depends on sampling guidance techniques, such as classifier …
but their quality heavily depends on sampling guidance techniques, such as classifier …
Diffusion models and representation learning: A survey
Diffusion Models are popular generative modeling methods in various vision tasks, attracting
significant attention. They can be considered a unique instance of self-supervised learning …
significant attention. They can be considered a unique instance of self-supervised learning …
Visual autoregressive modeling: Scalable image generation via next-scale prediction
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …
Disco-diff: Enhancing continuous diffusion models with discrete latents
Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion
process to encode data into a simple Gaussian distribution. However, encoding a complex …
process to encode data into a simple Gaussian distribution. However, encoding a complex …
Cross-conditioned diffusion model for medical image to image translation
Multi-modal magnetic resonance imaging (MRI) provides rich, complementary information
for analyzing diseases. However, the practical challenges of acquiring multiple MRI …
for analyzing diseases. However, the practical challenges of acquiring multiple MRI …
Metamorph: Multimodal understanding and generation via instruction tuning
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple and effective
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …
Bigr: Harnessing binary latent codes for image generation and improved visual representation capabilities
We introduce BiGR, a novel conditional image generation model using compact binary
latent codes for generative training, focusing on enhancing both generation and …
latent codes for generative training, focusing on enhancing both generation and …
Contrastive learning with synthetic positives
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-
supervised learning (SSL) techniques by utilizing the similarity of multiple instances within …
supervised learning (SSL) techniques by utilizing the similarity of multiple instances within …