The multi-modal fusion in visual question answering: a review of attention mechanisms
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …
fields of computer vision and natural language processing that requires a computer to output …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Scaling language-image pre-training via masking
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …
method for training CLIP. Our method randomly masks out and removes a large portion of …
Bottom-up and top-down attention for image captioning and visual question answering
Top-down visual attention mechanisms have been used extensively in image captioning
and visual question answering (VQA) to enable deeper image understanding through fine …
and visual question answering (VQA) to enable deeper image understanding through fine …
Gqa: A new dataset for real-world visual reasoning and compositional question answering
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …
question answering, seeking to address key shortcomings of previous VQA datasets. We …
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Problems at the intersection of vision and language are of significant importance both as
challenging research questions and for the rich set of applications they enable. However …
challenging research questions and for the rich set of applications they enable. However …
Deep modular co-attention networks for visual question answering
Abstract Visual Question Answering (VQA) requires a fine-grained and simultaneous
understanding of both the visual content of images and the textual content of questions …
understanding of both the visual content of images and the textual content of questions …
Bilinear attention networks
Attention networks in multimodal learning provide an efficient way to utilize given visual
information selectively. However, the computational cost to learn attention distributions for …
information selectively. However, the computational cost to learn attention distributions for …
Explainable deep learning: A field guide for the uninitiated
Deep neural networks (DNNs) are an indispensable machine learning tool despite the
difficulty of diagnosing what aspects of a model's input drive its decisions. In countless real …
difficulty of diagnosing what aspects of a model's input drive its decisions. In countless real …
Knowledge base graph embedding module design for Visual question answering model
In this paper, a knowledge base graph embedding module is constructed to extend the
versatility of knowledge-based VQA (Visual Question Answering) models. The knowledge …
versatility of knowledge-based VQA (Visual Question Answering) models. The knowledge …