Attention mechanisms in computer vision: A survey

MH Guo, TX Xu, JJ Liu, ZN Liu, PT Jiang, TJ Mu… - Computational visual …, 2022 - Springer
Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …

Multimodal learning with graphs

Y Ektefaie, G Dasoulas, A Noori, M Farhat… - Nature Machine …, 2023 - nature.com
Artificial intelligence for graphs has achieved remarkable success in modelling complex
systems, ranging from dynamic networks in biology to interacting particle systems in physics …

Drivelm: Driving with graph visual question answering

C Sima, K Renz, K Chitta, L Chen, H Zhang… - … on Computer Vision, 2024 - Springer
We study how vision-language models (VLMs) trained on web-scale data can be integrated
into end-to-end driving systems to boost generalization and enable interactivity with human …

Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI

Z Zhu, X He, G Qi, Y Li, B Cong, Y Liu - Information Fusion, 2023 - Elsevier
Brain tumor segmentation in multimodal MRI has great significance in clinical diagnosis and
treatment. The utilization of multimodal information plays a crucial role in brain tumor …

Graph neural networks: foundation, frontiers and applications

L Wu, P Cui, J Pei, L Zhao, X Guo - … of the 28th ACM SIGKDD Conference …, 2022 - dl.acm.org
The field of graph neural networks (GNNs) has seen rapid and incredible strides over the
recent years. Graph neural networks, also known as deep learning on graphs, graph …

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021 - proceedings.neurips.cc
We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

Pre-trained image processing transformer

H Chen, Y Wang, T Guo, C Xu… - Proceedings of the …, 2021 - openaccess.thecvf.com
As the computing power of modern hardware is increasing strongly, pre-trained deep
learning models (eg, BERT, GPT-3) learned on large-scale datasets have shown their …

Fast fourier convolution

L Chi, B Jiang, Y Mu - Advances in Neural Information …, 2020 - proceedings.neurips.cc
Vanilla convolutions in modern deep networks are known to operate locally and at fixed
scale (eg, the widely-adopted 3* 3 kernels in image-oriented tasks). This causes low efficacy …

A survey on visual transformer

K Han, Y Wang, H Chen, X Chen, J Guo, Z Liu… - arxiv preprint arxiv …, 2020 - arxiv.org
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …