Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

CY Hsieh, J Zhang, Z Ma… - Advances in neural …, 2024 - proceedings.neurips.cc
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Y Hu, B Liu, J Kasai, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …

Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024 - Springer
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Compositional chain-of-thought prompting for large multimodal models

C Mitra, B Huang, T Darrell… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …

Contrastive region guidance: Improving grounding in vision-language models without training

D Wan, J Cho, E Stengel-Eskin, M Bansal - European Conference on …, 2024 - Springer
Highlighting particularly relevant regions of an image can improve the performance of vision-
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …

Adaptive testing of computer vision models

I Gao, G Ilharco, S Lundberg… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision models often fail systematically on groups of data that share common semantic
characteristics (eg, rare objects or unusual scenes), but identifying these failure modes is a …

FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction

H Hua, J Shi, K Kafle, S Jenni, D Zhang… - … on Computer Vision, 2024 - Springer
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …

Videocon: Robust video-language alignment via contrast captions

H Bansal, Y Bitton, I Szpektor… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite being (pre) trained on a massive amount of data state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …

What's" up" with vision-language models? Investigating their struggle with spatial reasoning

A Kamath, J Hessel, KW Chang - arxiv preprint arxiv:2310.19785, 2023 - arxiv.org
Recent vision-language (VL) models are powerful, but can they reliably distinguish" right"
from" left"? We curate three new corpora to quantify model comprehension of such basic …

Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships

S Koch, N Vaskevicius, M Colosi… - Proceedings of the …, 2024 - openaccess.thecvf.com
Current approaches for 3D scene graph prediction rely on labeled datasets to train models
for a fixed set of known object classes and relationship categories. We present Open3DSG …