Reinforcement learning for fine-tuning text-to-image diffusion models

Y Fan, O Watkins, Y Du, H Liu, M Ryu… - Advances in …, 2024 - proceedings.neurips.cc
Learning from human feedback has been shown to improve text-to-image models. These
techniques first learn a reward function that captures what humans care about in the task …

What you see is what you read? improving text-image alignment evaluation

M Yarom, Y Bitton, S Changpinyo… - Advances in …, 2024 - proceedings.neurips.cc
Automatically determining whether a text and a corresponding image are semantically
aligned is a significant challenge for vision-language models, with applications in generative …

Discriminative probing and tuning for text-to-image generation

L Qu, W Wang, Y Li, H Zhang, L Nie… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite advancements in text-to-image generation (T2I) prior methods often face text-image
misalignment problems such as relation confusion in generated images. Existing solutions …

Revision: Rendering tools enable spatial fidelity in vision-language models

A Chatterjee, Y Luo, T Gokhale, Y Yang… - European Conference on …, 2024 - Springer
Abstract Text-to-Image (T2I) and multimodal large language models (MLLMs) have been
adopted in solutions for several computer vision and multimodal learning tasks. However, it …

Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task

M Okawa, ES Lubana, R Dick… - Advances in Neural …, 2024 - proceedings.neurips.cc
Modern generative models exhibit unprecedented capabilities to generate extremely
realistic data. However, given the inherent compositionality of real world, reliable use of …

Controllable text-to-image generation with gpt-4

T Zhang, Y Zhang, V Vineet, N Joshi… - arxiv preprint arxiv …, 2023 - arxiv.org
Current text-to-image generation models often struggle to follow textual instructions,
especially the ones requiring spatial reasoning. On the other hand, Large Language Models …

Unsupervised compositional concepts discovery with text-to-image generative models

N Liu, Y Du, S Li, JB Tenenbaum… - Proceedings of the …, 2023 - openaccess.thecvf.com
Text-to-image generative models have enabled high-resolution image synthesis across
different domains, but require users to specify the content they wish to generate. In this …

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

A Chatterjee, GBM Stan, E Aflalo, S Paul… - … on Computer Vision, 2024 - Springer
One of the key shortcomings in current text-to-image (T2I) models is their inability to
consistently generate images which faithfully follow the spatial relationships specified in the …

What's" up" with vision-language models? Investigating their struggle with spatial reasoning

A Kamath, J Hessel, KW Chang - arxiv preprint arxiv:2310.19785, 2023 - arxiv.org
Recent vision-language (VL) models are powerful, but can they reliably distinguish" right"
from" left"? We curate three new corpora to quantify model comprehension of such basic …