Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Align your latents: High-resolution video synthesis with latent diffusion models
Abstract Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding
excessive compute demands by training a diffusion model in a compressed lower …
excessive compute demands by training a diffusion model in a compressed lower …
Generating diverse and natural 3d human motions from text
Automated generation of 3D human motions from text is a challenging problem. The
generated motions are expected to be sufficiently diverse to explore the text-grounded …
generated motions are expected to be sufficiently diverse to explore the text-grounded …
Videocomposer: Compositional video synthesis with motion controllability
The pursuit of controllability as a higher standard of visual content creation has yielded
remarkable progress in customizable image synthesis. However, achieving controllable …
remarkable progress in customizable image synthesis. However, achieving controllable …
Phenaki: Variable length video generation from open domain textual descriptions
We present Phenaki, a model capable of realistic video synthesis given a sequence of
textual prompts. Generating videos from text is particularly challenging due to the …
textual prompts. Generating videos from text is particularly challenging due to the …
Stable video diffusion: Scaling latent video diffusion models to large datasets
We present Stable Video Diffusion-a latent video diffusion model for high-resolution, state-of-
the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained …
the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained …
Mastering atari with discrete world models
Intelligent agents need to generalize from past experience to achieve goals in complex
environments. World models facilitate such generalization and allow learning behaviors …
environments. World models facilitate such generalization and allow learning behaviors …
DriveDreamer: Towards Real-World-Drive World Models for Autonomous Driving
World models, especially in autonomous driving, are trending and drawing extensive
attention due to their capacity for comprehending driving environments. The established …
attention due to their capacity for comprehending driving environments. The established …
Videobert: A joint model for video and language representation learning
Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …
unlabeled data available on platforms like YouTube. Whereas most existing approaches …