Is sora a world simulator? a comprehensive survey on general world models and beyond

Z Zhu, X Wang, W Zhao, C Min, N Deng, M Dou… - arxiv preprint arxiv …, 2024 - arxiv.org
General world models represent a crucial pathway toward achieving Artificial General
Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual …

Representation alignment for generation: Training diffusion transformers is easier than you think

S Yu, S Kwak, H Jang, J Jeong, J Huang, J Shin… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have shown that the denoising process in (generative) diffusion models can
induce meaningful (discriminative) representations inside the model, though the quality of …

Self-rectifying diffusion sampling with perturbed-attention guidance

D Ahn, H Cho, J Min, W Jang, J Kim, SH Kim… - … on Computer Vision, 2024 - Springer
Recent studies have demonstrated that diffusion models can generate high-quality samples,
but their quality heavily depends on sampling guidance techniques, such as classifier …

Diffusion models and representation learning: A survey

M Fuest, P Ma, M Gui, JS Fischer, VT Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion Models are popular generative modeling methods in various vision tasks, attracting
significant attention. They can be considered a unique instance of self-supervised learning …

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng, L Wang - arxiv preprint arxiv:2404.02905, 2024 - arxiv.org
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that
redefines the autoregressive learning on images as coarse-to-fine" next-scale prediction" or" …

Disco-diff: Enhancing continuous diffusion models with discrete latents

Y Xu, G Corso, T Jaakkola, A Vahdat… - arxiv preprint arxiv …, 2024 - arxiv.org
Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion
process to encode data into a simple Gaussian distribution. However, encoding a complex …

Cross-conditioned diffusion model for medical image to image translation

Z **ng, S Yang, S Chen, T Ye, Y Yang, J Qin… - … Conference on Medical …, 2024 - Springer
Multi-modal magnetic resonance imaging (MRI) provides rich, complementary information
for analyzing diseases. However, the practical challenges of acquiring multiple MRI …

Metamorph: Multimodal understanding and generation via instruction tuning

S Tong, D Fan, J Zhu, Y **ong, X Chen, K Sinha… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we propose Visual-Predictive Instruction Tuning (VPiT)-a simple and effective
extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an …

Bigr: Harnessing binary latent codes for image generation and improved visual representation capabilities

S Hao, X Liu, X Qi, S Zhao, B Zi, R **ao, K Han… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce BiGR, a novel conditional image generation model using compact binary
latent codes for generative training, focusing on enhancing both generation and …

Contrastive learning with synthetic positives

D Zeng, Y Wu, X Hu, X Xu, Y Shi - European Conference on Computer …, 2024 - Springer
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-
supervised learning (SSL) techniques by utilizing the similarity of multiple instances within …