Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024 - Springer
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Visual programming for step-by-step text-to-image generation and evaluation

J Cho, A Zala, M Bansal - Advances in Neural Information …, 2023 - proceedings.neurips.cc
As large language models have demonstrated impressive performance in many domains,
recent works have adopted language models (LMs) as controllers of visual modules for …

Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-to-image generation

J Cho, Y Hu, R Garg, P Anderson, R Krishna… - arxiv preprint arxiv …, 2023 - arxiv.org
Evaluating text-to-image models is notoriously difficult. A strong recent approach for
assessing text-image faithfulness is based on QG/A (question generation and answering) …

Docci: Descriptions of connected and contrasting images

Y Onoe, S Rane, Z Berger, Y Bitton, J Cho… - … on Computer Vision, 2024 - Springer
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T)
research. However, current datasets lack descriptions with fine-grained detail that would …

Videoprism: A foundational visual encoder for video understanding

L Zhao, NB Gundavarapu, L Yuan, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video
understanding tasks with a single frozen model. We pretrain VideoPrism on a …

Evaluating and improving compositional text-to-visual generation

B Li, Z Lin, D Pathak, J Li, Y Fei, K Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
While text-to-visual models now produce photo-realistic images and videos they struggle
with compositional text prompts involving attributes relationships and higher-order …

Contrastive region guidance: Improving grounding in vision-language models without training

D Wan, J Cho, E Stengel-Eskin, M Bansal - European Conference on …, 2024 - Springer
Highlighting particularly relevant regions of an image can improve the performance of vision-
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …

A survey on advancements in image-text multimodal models: From general techniques to biomedical implementations

R Guo, J Wei, L Sun, B Yu, G Chang, D Liu… - Computers in biology …, 2024 - Elsevier
With the significant advancements of Large Language Models (LLMs) in the field of Natural
Language Processing (NLP), the development of image-text multimodal models has …

Dreammatcher: appearance matching self-attention for semantically-consistent text-to-image personalization

J Nam, H Kim, DJ Lee, S **, S Kim… - Proceedings of the …, 2024 - openaccess.thecvf.com
The objective of text-to-image (T2I) personalization is to customize a diffusion model to a
user-provided reference concept generating diverse images of the concept aligned with the …

FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction

H Hua, J Shi, K Kafle, S Jenni, D Zhang… - … on Computer Vision, 2024 - Springer
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …