Measuring progress in fine-grained vision-and-language understanding

E Bugliarello, L Sartran, A Agrawal… - arxiv preprint arxiv …, 2023 - arxiv.org
While pretraining on large-scale image-text data from the Web has facilitated rapid progress
on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained …

Grounded Image Text Matching with Mismatched Relation Reasoning

Y Wu, Y Wei, H Wang, Y Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract This paper introduces Grounded Image Text Matching with Mismatched Relation
(GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding …

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

Towards an Exhaustive Evaluation of Vision-Language Foundation Models

E Salin, S Ayache, B Favre - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Vision-language foundation models have had considerable increase in performances in the
last few years. However, there is still a lack of comprehensive evaluation methods able to …

Weakly-supervised learning of visual relations in multimodal pretraining

E Bugliarello, A Nematzadeh, LA Hendricks - arxiv preprint arxiv …, 2023 - arxiv.org
Recent work in vision-and-language pretraining has investigated supervised signals from
object detection data to learn better, fine-grained multimodal representations. In this work …

Analyzing the Robustness of Vision & Language Models

A Shirnin, N Andreev, S Potapova… - … /ACM Transactions on …, 2024 - ieeexplore.ieee.org
We present an approach to evaluate the robustness of pre-trained vision and language
(V&L) models to noise in input data. Given a source image/text, we perturb it using standard …

Semantic composition in visually grounded language models

R Pandey - arxiv preprint arxiv:2305.16328, 2023 - arxiv.org
What is sentence meaning and its ideal representation? Much of the expressive power of
human language derives from semantic composition, the mind's ability to represent meaning …

MASS: Overcoming Language Bias in Image-Text Matching

J Chung, S Lim, S Lee, Y Yu - arxiv preprint arxiv:2501.11469, 2025 - arxiv.org
Pretrained visual-language models have made significant advancements in multimodal
tasks, including image-text retrieval. However, a major challenge in image-text matching lies …

Extract Free Dense Misalignment from CLIP

JY Nam, J Im, W Kim, T Kil - arxiv preprint arxiv:2412.18404, 2024 - arxiv.org
Recent vision-language foundation models still frequently produce outputs misaligned with
their inputs, evidenced by object hallucination in captioning and prompt misalignment in the …

From Pixels to Explanations: Uncovering the Reasoning Process in Visual Question Answering

S Zhang, J Liu, Z Wei - Proceedings of the 5th ACM International …, 2023 - dl.acm.org
Visual reasoning requires models to construct a reasoning process towards the final
decision. Previous studies have used attention maps or textual explanations to illustrate the …