Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Just ask: Learning to answer questions from millions of narrated videos
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …
Manual annotation of questions and answers for videos, however, is tedious, expensive and …
Bottom-up and top-down attention for image captioning and visual question answering
Top-down visual attention mechanisms have been used extensively in image captioning
and visual question answering (VQA) to enable deeper image understanding through fine …
and visual question answering (VQA) to enable deeper image understanding through fine …
From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Don't just assume; look and answer: Overcoming priors for visual question answering
A number of studies have found that today's Visual Question Answering (VQA) models are
heavily driven by superficial correlations in the training data and lack sufficient image …
heavily driven by superficial correlations in the training data and lack sufficient image …
Pseudo-q: Generating pseudo language queries for visual grounding
Visual grounding, ie, localizing objects in images according to natural language queries, is
an important topic in visual language understanding. The most effective approaches for this …
an important topic in visual language understanding. The most effective approaches for this …
A zero-shot framework for sketch based image retrieval
Sketch-based image retrieval (SBIR) is the task of retrieving images from a natural image
database that correspond to a given hand-drawn sketch. Ideally, an SBIR model should …
database that correspond to a given hand-drawn sketch. Ideally, an SBIR model should …
All you may need for vqa are image captions
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …
Mutant: A training paradigm for out-of-distribution generalization in visual question answering
While progress has been made on the visual question answering leaderboards, models
often utilize spurious correlations and priors in datasets under the iid setting. As such …
often utilize spurious correlations and priors in datasets under the iid setting. As such …
Learning what makes a difference from counterfactual examples and gradient supervision
One of the primary challenges limiting the applicability of deep learning is its susceptibility to
learning spurious correlations rather than the underlying mechanisms of the task of interest …
learning spurious correlations rather than the underlying mechanisms of the task of interest …