The multi-modal fusion in visual question answering: a review of attention mechanisms
Abstract Visual Question Answering (VQA) is a significant cross-disciplinary issue in the
fields of computer vision and natural language processing that requires a computer to output …
fields of computer vision and natural language processing that requires a computer to output …
Attention mechanisms in computer vision: A survey
Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …
this observation, attention mechanisms were introduced into computer vision with the aim of …
Dual aggregation transformer for image super-resolution
Transformer has recently gained considerable popularity in low-level vision tasks, including
image super-resolution (SR). These networks utilize self-attention along different …
image super-resolution (SR). These networks utilize self-attention along different …
Segnext: Rethinking convolutional attention design for semantic segmentation
We present SegNeXt, a simple convolutional network architecture for semantic
segmentation. Recent transformer-based models have dominated the field of se-mantic …
segmentation. Recent transformer-based models have dominated the field of se-mantic …
Efficient multi-scale attention module with cross-spatial learning
D Ouyang, S He, G Zhang, M Luo… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Remarkable effectiveness of the channel or spatial attention mechanisms for producing
more discernible feature representation are illustrated in various computer vision tasks …
more discernible feature representation are illustrated in various computer vision tasks …
Visual attention network
While originally designed for natural language processing tasks, the self-attention
mechanism has recently taken various computer vision areas by storm. However, the 2D …
mechanism has recently taken various computer vision areas by storm. However, the 2D …
Clipcap: Clip prefix for image captioning
Image captioning is a fundamental task in vision-language understanding, where the model
predicts a textual informative caption to a given input image. In this paper, we present a …
predicts a textual informative caption to a given input image. In this paper, we present a …
Egocentric video-language pretraining
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …
to advance a wide range of video-text downstream tasks, has recently received increasing …
Dynamic neural networks: A survey
Dynamic neural network is an emerging research topic in deep learning. Compared to static
models which have fixed computational graphs and parameters at the inference stage …
models which have fixed computational graphs and parameters at the inference stage …
Contextual object detection with multimodal large language models
Abstract Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-
language tasks, such as image captioning and question answering, but lack the essential …
language tasks, such as image captioning and question answering, but lack the essential …