Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …
understanding of vision-language models have permeated the machine learning ecosystem …
Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …
to-image generation models, systems often fail to produce images that accurately align with …
Evaluating text-to-visual generation with image-to-text generation
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …
challenging because of the lack of effective metrics and standardized benchmarks. For …
Compositional chain-of-thought prompting for large multimodal models
The combination of strong visual backbones and Large Language Model (LLM) reasoning
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range …
Contrastive region guidance: Improving grounding in vision-language models without training
Highlighting particularly relevant regions of an image can improve the performance of vision-
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …
language models (VLMs) on various vision-language (VL) tasks by guiding the model to …
Adaptive testing of computer vision models
Vision models often fail systematically on groups of data that share common semantic
characteristics (eg, rare objects or unusual scenes), but identifying these failure modes is a …
characteristics (eg, rare objects or unusual scenes), but identifying these failure modes is a …
FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …
language models (VLMs) with remarkable proficiency in comprehending and generating …
Videocon: Robust video-language alignment via contrast captions
Despite being (pre) trained on a massive amount of data state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …
alignment models are not robust to semantically-plausible contrastive changes in the video …
What's" up" with vision-language models? Investigating their struggle with spatial reasoning
Recent vision-language (VL) models are powerful, but can they reliably distinguish" right"
from" left"? We curate three new corpora to quantify model comprehension of such basic …
from" left"? We curate three new corpora to quantify model comprehension of such basic …
Open3dsg: Open-vocabulary 3d scene graphs from point clouds with queryable objects and open-set relationships
Current approaches for 3D scene graph prediction rely on labeled datasets to train models
for a fixed set of known object classes and relationship categories. We present Open3DSG …
for a fixed set of known object classes and relationship categories. We present Open3DSG …