Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Photorealistic text-to-image diffusion models with deep language understanding
We present Imagen, a text-to-image diffusion model with an unprecedented degree of
photorealism and a deep level of language understanding. Imagen builds on the power of …
photorealism and a deep level of language understanding. Imagen builds on the power of …
Video diffusion models
Generating temporally coherent high fidelity video is an important milestone in generative
modeling research. We make progress towards this milestone by proposing a diffusion …
modeling research. We make progress towards this milestone by proposing a diffusion …
When and why vision-language models behave like bags-of-words, and what to do about it?
Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode compositional information. Here, we create …
applications, it is unclear how well they encode compositional information. Here, we create …
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation
Despite the stunning ability to generate high-quality images by recent text-to-image models,
current approaches often struggle to effectively compose objects with different attributes and …
current approaches often struggle to effectively compose objects with different attributes and …
Text-to-image diffusion models in generative ai: A survey
This survey reviews text-to-image diffusion models in the context that diffusion models have
emerged to be popular for a wide range of generative tasks. As a self-contained work, this …
emerged to be popular for a wide range of generative tasks. As a self-contained work, this …
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …
understanding of vision-language models have permeated the machine learning ecosystem …
Easily accessible text-to-image generation amplifies demographic stereotypes at large scale
Machine learning models that convert user-written text descriptions into images are now
widely available online and used by millions of users to generate millions of images a day …
widely available online and used by millions of users to generate millions of images a day …
Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …
to-image generation models, systems often fail to produce images that accurately align with …