What you see is what you read? improving text-image alignment evaluation
Automatically determining whether a text and a corresponding image are semantically
aligned is a significant challenge for vision-language models, with applications in generative …
aligned is a significant challenge for vision-language models, with applications in generative …
Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment
Text-conditioned image generation models often generate incorrect associations between
entities and their visual attributes. This reflects an impaired map** between linguistic …
entities and their visual attributes. This reflects an impaired map** between linguistic …
Teaching clip to count to ten
Large vision-language models, such as CLIP, learn robust representations of text and
images, facilitating advances in many downstream tasks, including zero-shot classification …
images, facilitating advances in many downstream tasks, including zero-shot classification …
Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task
Modern generative models exhibit unprecedented capabilities to generate extremely
realistic data. However, given the inherent compositionality of the real world, reliable use of …
realistic data. However, given the inherent compositionality of the real world, reliable use of …
Emergence of hidden capabilities: Exploring learning dynamics in concept space
Modern generative models demonstrate impressive capabilities, likely stemming from an
ability to identify and manipulate abstract concepts underlying their training data. However …
ability to identify and manipulate abstract concepts underlying their training data. However …
[HTML][HTML] DALL· E 2 fails to reliably capture common syntactic processes
Abstract Machine intelligence is increasingly being linked to claims about sentience,
language processing, and an ability to comprehend and transform natural language into a …
language processing, and an ability to comprehend and transform natural language into a …
SemEval-2023 task 1: Visual word sense disambiguation
This paper presents the Visual Word Sense Disambiguation (Visual-WSD) task. The
objective of Visual-WSD is to identify among a set of ten images the one that corresponds to …
objective of Visual-WSD is to identify among a set of ten images the one that corresponds to …
Auditing gender presentation differences in text-to-image models
Text-to-image models, which can generate high-quality images based on textual input, have
recently enabled various content-creation tools. Despite significantly affecting a wide range …
recently enabled various content-creation tools. Despite significantly affecting a wide range …
Global-local image perceptual score (glips): Evaluating photorealistic quality of ai-generated images
This article introduces the global-local image perceptual score (GLIPS), an image metric
designed to assess the photorealistic image quality of AI-generated images with a high …
designed to assess the photorealistic image quality of AI-generated images with a high …
Object-conditioned energy-based attention map alignment in text-to-image diffusion models
Text-to-image diffusion models have shown great success in generating high-quality text-
guided images. Yet, these models may still fail to semantically align generated images with …
guided images. Yet, these models may still fail to semantically align generated images with …