Measuring progress in fine-grained vision-and-language understanding
While pretraining on large-scale image-text data from the Web has facilitated rapid progress
on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained …
on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained …
Grounded Image Text Matching with Mismatched Relation Reasoning
Abstract This paper introduces Grounded Image Text Matching with Mismatched Relation
(GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding …
(GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding …
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …
downstream tasks which have been widely used for visual grounding tasks in a weakly …
Towards an Exhaustive Evaluation of Vision-Language Foundation Models
Vision-language foundation models have had considerable increase in performances in the
last few years. However, there is still a lack of comprehensive evaluation methods able to …
last few years. However, there is still a lack of comprehensive evaluation methods able to …
Weakly-supervised learning of visual relations in multimodal pretraining
Recent work in vision-and-language pretraining has investigated supervised signals from
object detection data to learn better, fine-grained multimodal representations. In this work …
object detection data to learn better, fine-grained multimodal representations. In this work …
Analyzing the Robustness of Vision & Language Models
We present an approach to evaluate the robustness of pre-trained vision and language
(V&L) models to noise in input data. Given a source image/text, we perturb it using standard …
(V&L) models to noise in input data. Given a source image/text, we perturb it using standard …
Semantic composition in visually grounded language models
R Pandey - arxiv preprint arxiv:2305.16328, 2023 - arxiv.org
What is sentence meaning and its ideal representation? Much of the expressive power of
human language derives from semantic composition, the mind's ability to represent meaning …
human language derives from semantic composition, the mind's ability to represent meaning …
MASS: Overcoming Language Bias in Image-Text Matching
Pretrained visual-language models have made significant advancements in multimodal
tasks, including image-text retrieval. However, a major challenge in image-text matching lies …
tasks, including image-text retrieval. However, a major challenge in image-text matching lies …
Extract Free Dense Misalignment from CLIP
Recent vision-language foundation models still frequently produce outputs misaligned with
their inputs, evidenced by object hallucination in captioning and prompt misalignment in the …
their inputs, evidenced by object hallucination in captioning and prompt misalignment in the …
From Pixels to Explanations: Uncovering the Reasoning Process in Visual Question Answering
Visual reasoning requires models to construct a reasoning process towards the final
decision. Previous studies have used attention maps or textual explanations to illustrate the …
decision. Previous studies have used attention maps or textual explanations to illustrate the …