Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024 - Springer
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Evaluating and improving compositional text-to-visual generation

B Li, Z Lin, D Pathak, J Li, Y Fei, K Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
While text-to-visual models now produce photo-realistic images and videos they struggle
with compositional text prompts involving attributes relationships and higher-order …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

Synthesize diagnose and optimize: Towards fine-grained vision-language understanding

W Peng, S **e, Z You, S Lan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Vision language models (VLM) have demonstrated remarkable performance across various
downstream tasks. However understanding fine-grained visual-linguistic concepts such as …

Robust noisy correspondence learning with equivariant similarity consistency

Y Yang, L Wang, E Yang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The surge in multi-modal data has propelled cross-modal matching to the forefront of
research interest. However the challenge lies in the laborious and expensive process of …

Auto-encoding morph-tokens for multimodal llm

K Pan, S Tang, J Li, Z Fan, W Chow, S Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
For multimodal LLMs, the synergy of visual comprehension (textual output) and generation
(visual output) presents an ongoing challenge. This is due to a conflicting objective: for …

Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives

M Patel, NSA Kusumba, S Cheng… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Contrastive Language-Image Pretraining (CLIP) models maximize the mutual
information between text and visual modalities to learn representations. This makes the …

Revisiting the role of language priors in vision-language models

Z Lin, X Chen, D Pathak, P Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Vision-language models (VLMs) are impactful in part because they can be applied to a
variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We …

Vismin: Visual minimal-change understanding

R Awal, S Ahmadi, L Zhang… - Advances in Neural …, 2025 - proceedings.neurips.cc
Fine-grained understanding of objects, attributes, and relationships between objects is
crucial for visual-language models (VLMs). To evaluate VLMs' fine-grained understanding …

Rankclip: Ranking-consistent language-image pretraining

Y Zhang, Z Zhao, Z Chen, Z Feng, Z Ding… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for
vision-language models in many downstream tasks. However, their dependency on rigid …