Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

Just ask: Learning to answer questions from millions of narrated videos

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2021 - openaccess.thecvf.com
Recent methods for visual question answering rely on large-scale annotated datasets.
Manual annotation of questions and answers for videos, however, is tedious, expensive and …

Bottom-up and top-down attention for image captioning and visual question answering

P Anderson, X He, C Buehler… - Proceedings of the …, 2018 - openaccess.thecvf.com
Top-down visual attention mechanisms have been used extensively in image captioning
and visual question answering (VQA) to enable deeper image understanding through fine …

From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities

MF Ishmam, MSH Shovon, MF Mridha, N Dey - Information Fusion, 2024 - Elsevier
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …

Don't just assume; look and answer: Overcoming priors for visual question answering

A Agrawal, D Batra, D Parikh… - Proceedings of the …, 2018 - openaccess.thecvf.com
A number of studies have found that today's Visual Question Answering (VQA) models are
heavily driven by superficial correlations in the training data and lack sufficient image …

Pseudo-q: Generating pseudo language queries for visual grounding

H Jiang, Y Lin, D Han, S Song… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual grounding, ie, localizing objects in images according to natural language queries, is
an important topic in visual language understanding. The most effective approaches for this …

A zero-shot framework for sketch based image retrieval

SK Yelamarthi, SK Reddy… - Proceedings of the …, 2018 - openaccess.thecvf.com
Sketch-based image retrieval (SBIR) is the task of retrieving images from a natural image
database that correspond to a given hand-drawn sketch. Ideally, an SBIR model should …

All you may need for vqa are image captions

S Changpinyo, D Kukliansky, I Szpektor… - arxiv preprint arxiv …, 2022 - arxiv.org
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but
has not enjoyed the same level of engagement in terms of data creation. In this paper, we …

Mutant: A training paradigm for out-of-distribution generalization in visual question answering

T Gokhale, P Banerjee, C Baral, Y Yang - arxiv preprint arxiv:2009.08566, 2020 - arxiv.org
While progress has been made on the visual question answering leaderboards, models
often utilize spurious correlations and priors in datasets under the iid setting. As such …

Learning what makes a difference from counterfactual examples and gradient supervision

D Teney, E Abbasnedjad, A van den Hengel - Computer Vision–ECCV …, 2020 - Springer
One of the primary challenges limiting the applicability of deep learning is its susceptibility to
learning spurious correlations rather than the underlying mechanisms of the task of interest …