Swin transformer: Hierarchical vision transformer using shifted windows

Z Liu, Y Lin, Y Cao, H Hu, Y Wei… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents a new vision Transformer, called Swin Transformer, that capably serves
as a general-purpose backbone for computer vision. Challenges in adapting Transformer …

Cvt: Introducing convolutions to vision transformers

H Wu, B **ao, N Codella, M Liu, X Dai… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present in this paper a new architecture, named Convolutional vision Transformer (CvT),
that improves Vision Transformer (ViT) in performance and efficiency by introducing …

Cswin transformer: A general vision transformer backbone with cross-shaped windows

X Dong, J Bao, D Chen, W Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract We present CSWin Transformer, an efficient and effective Transformer-based
backbone for general-purpose vision tasks. A challenging issue in Transformer design is …

Transformer in transformer

K Han, A **ao, E Wu, J Guo, C Xu… - Advances in neural …, 2021 - proceedings.neurips.cc
Transformer is a new kind of neural architecture which encodes the input data as powerful
features via the attention mechanism. Basically, the visual transformers first divide the input …

Multiscale vision transformers

H Fan, B **ong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

Levit: a vision transformer in convnet's clothing for faster inference

B Graham, A El-Nouby, H Touvron… - Proceedings of the …, 2021 - openaccess.thecvf.com
We design a family of image classification architectures that optimize the trade-off between
accuracy and efficiency in a high-speed regime. Our work exploits recent findings in …

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Rethinking and improving relative position encoding for vision transformer

K Wu, H Peng, M Chen, J Fu… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Relative position encoding (RPE) is important for transformer to capture sequence ordering
of input tokens. General efficacy has been proven in natural language processing. However …

P2T: Pyramid pooling transformer for scene understanding

YH Wu, Y Liu, X Zhan… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Recently, the vision transformer has achieved great success by pushing the state-of-the-art
of various vision tasks. One of the most challenging problems in the vision transformer is that …

Towards robust vision transformer

X Mao, G Qi, Y Chen, X Li, R Duan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Recent advances on Vision Transformer (ViT) and its improved variants have
shown that self-attention-based networks surpass traditional Convolutional Neural Networks …