Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Multimodal intelligence: Representation learning, information fusion, and applications
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …
natural language processing since 2010. Each of these tasks involves a single modality in …
Multiscale feature extraction and fusion of image and text in VQA
Abstract The Visual Question Answering (VQA) system is the process of finding useful
information from images related to the question to answer the question correctly. It can be …
information from images related to the question to answer the question correctly. It can be …
Deep modular co-attention networks for visual question answering
Abstract Visual Question Answering (VQA) requires a fine-grained and simultaneous
understanding of both the visual content of images and the textual content of questions …
understanding of both the visual content of images and the textual content of questions …
Attention, please! A survey of neural attention models in deep learning
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …
limited ability to process competing sources, attention mechanisms select, modulate, and …
Image retrieval on real-life images with pre-trained vision-and-language models
We extend the task of composed image retrieval, where an input query consists of an image
and short textual description of how to modify the image. Existing methods have only been …
and short textual description of how to modify the image. Existing methods have only been …
Bilinear attention networks
Attention networks in multimodal learning provide an efficient way to utilize given visual
information selectively. However, the computational cost to learn attention distributions for …
information selectively. However, the computational cost to learn attention distributions for …
Residual attention network for image classification
In this work, we propose" Residual Attention Network", a convolutional neural network using
attention mechanism which can incorporate with state-of-art feed forward network …
attention mechanism which can incorporate with state-of-art feed forward network …
Deep multimodal learning: A survey on recent advances and trends
The success of deep learning has been a catalyst to solving increasingly complex machine-
learning problems, which often involve multiple data modalities. We review recent advances …
learning problems, which often involve multiple data modalities. We review recent advances …
Fashionvlp: Vision language transformer for fashion retrieval with feedback
Fashion image retrieval based on a query pair of reference image and natural language
feedback is a challenging task that requires models to assess fashion related information …
feedback is a challenging task that requires models to assess fashion related information …