What you see is what you read? improving text-image alignment evaluation

M Yarom, Y Bitton, S Changpinyo… - Advances in …, 2023‏ - proceedings.neurips.cc
Automatically determining whether a text and a corresponding image are semantically
aligned is a significant challenge for vision-language models, with applications in generative …

Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment

R Rassin, E Hirsch, D Glickman… - Advances in …, 2023‏ - proceedings.neurips.cc
Text-conditioned image generation models often generate incorrect associations between
entities and their visual attributes. This reflects an impaired map** between linguistic …

Teaching clip to count to ten

R Paiss, A Ephrat, O Tov, S Zada… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Large vision-language models, such as CLIP, learn robust representations of text and
images, facilitating advances in many downstream tasks, including zero-shot classification …

Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task

M Okawa, ES Lubana, R Dick… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
Modern generative models exhibit unprecedented capabilities to generate extremely
realistic data. However, given the inherent compositionality of the real world, reliable use of …

Emergence of hidden capabilities: Exploring learning dynamics in concept space

CF Park, M Okawa, A Lee… - Advances in Neural …, 2025‏ - proceedings.neurips.cc
Modern generative models demonstrate impressive capabilities, likely stemming from an
ability to identify and manipulate abstract concepts underlying their training data. However …

[HTML][HTML] DALL· E 2 fails to reliably capture common syntactic processes

E Leivada, E Murphy, G Marcus - Social Sciences & Humanities Open, 2023‏ - Elsevier
Abstract Machine intelligence is increasingly being linked to claims about sentience,
language processing, and an ability to comprehend and transform natural language into a …

SemEval-2023 task 1: Visual word sense disambiguation

A Raganato, I Calixto, A Ushio… - … 2023-Proceedings of …, 2023‏ - boa.unimib.it
This paper presents the Visual Word Sense Disambiguation (Visual-WSD) task. The
objective of Visual-WSD is to identify among a set of ten images the one that corresponds to …

Auditing gender presentation differences in text-to-image models

Y Zhang, L Jiang, G Turk, D Yang - … of the 4th ACM Conference on Equity …, 2024‏ - dl.acm.org
Text-to-image models, which can generate high-quality images based on textual input, have
recently enabled various content-creation tools. Despite significantly affecting a wide range …

Global-local image perceptual score (glips): Evaluating photorealistic quality of ai-generated images

M Aziz, U Rehman, MU Danish… - IEEE Transactions on …, 2025‏ - ieeexplore.ieee.org
This article introduces the global-local image perceptual score (GLIPS), an image metric
designed to assess the photorealistic image quality of AI-generated images with a high …

Object-conditioned energy-based attention map alignment in text-to-image diffusion models

Y Zhang, P Yu, YN Wu - European Conference on Computer Vision, 2024‏ - Springer
Text-to-image diffusion models have shown great success in generating high-quality text-
guided images. Yet, these models may still fail to semantically align generated images with …