Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
A survey on video diffusion models
The recent wave of AI-generated content (AIGC) has witnessed substantial success in
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …
computer vision, with the diffusion model playing a crucial role in this achievement. Due to …
Align your latents: High-resolution video synthesis with latent diffusion models
Abstract Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding
excessive compute demands by training a diffusion model in a compressed lower …
excessive compute demands by training a diffusion model in a compressed lower …
Scaling up gans for text-to-image synthesis
The recent success of text-to-image synthesis has taken the world by storm and captured the
general public's imagination. From a technical standpoint, it also marked a drastic change in …
general public's imagination. From a technical standpoint, it also marked a drastic change in …
Adversarial diffusion distillation
A Sauer, D Lorenz, A Blattmann… - European Conference on …, 2024 - Springer
Abstract We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that
efficiently samples large-scale foundational image diffusion models in just 1–4 steps while …
efficiently samples large-scale foundational image diffusion models in just 1–4 steps while …
Open-vocabulary panoptic segmentation with text-to-image diffusion models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …
Lavie: High-quality video generation with cascaded latent diffusion models
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …
pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task …
Instructpix2pix: Learning to follow image editing instructions
We propose a method for editing images from human instructions: given an input image and
a written instruction that tells the model what to do, our model follows these instructions to …
a written instruction that tells the model what to do, our model follows these instructions to …
Zero-shot image-to-image translation
Large-scale text-to-image generative models have shown their remarkable ability to
synthesize diverse, high-quality images. However, directly applying these models for real …
synthesize diverse, high-quality images. However, directly applying these models for real …
Dynamicrafter: Animating open-domain images with video diffusion priors
Animating a still image offers an engaging visual experience. Traditional image animation
techniques mainly focus on animating natural scenes with stochastic dynamics (eg clouds …
techniques mainly focus on animating natural scenes with stochastic dynamics (eg clouds …