Swin transformer: Hierarchical vision transformer using shifted windows
This paper presents a new vision Transformer, called Swin Transformer, that capably serves
as a general-purpose backbone for computer vision. Challenges in adapting Transformer …
as a general-purpose backbone for computer vision. Challenges in adapting Transformer …
Cvt: Introducing convolutions to vision transformers
We present in this paper a new architecture, named Convolutional vision Transformer (CvT),
that improves Vision Transformer (ViT) in performance and efficiency by introducing …
that improves Vision Transformer (ViT) in performance and efficiency by introducing …
Cswin transformer: A general vision transformer backbone with cross-shaped windows
Abstract We present CSWin Transformer, an efficient and effective Transformer-based
backbone for general-purpose vision tasks. A challenging issue in Transformer design is …
backbone for general-purpose vision tasks. A challenging issue in Transformer design is …
Transformer in transformer
Transformer is a new kind of neural architecture which encodes the input data as powerful
features via the attention mechanism. Basically, the visual transformers first divide the input …
features via the attention mechanism. Basically, the visual transformers first divide the input …
Multiscale vision transformers
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
by connecting the seminal idea of multiscale feature hierarchies with transformer models …
Levit: a vision transformer in convnet's clothing for faster inference
We design a family of image classification architectures that optimize the trade-off between
accuracy and efficiency in a high-speed regime. Our work exploits recent findings in …
accuracy and efficiency in a high-speed regime. Our work exploits recent findings in …
Uniformer: Unifying convolution and self-attention for visual recognition
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …
large local redundancy and complex global dependency in these visual data. Convolution …
Rethinking and improving relative position encoding for vision transformer
Relative position encoding (RPE) is important for transformer to capture sequence ordering
of input tokens. General efficacy has been proven in natural language processing. However …
of input tokens. General efficacy has been proven in natural language processing. However …
P2T: Pyramid pooling transformer for scene understanding
Recently, the vision transformer has achieved great success by pushing the state-of-the-art
of various vision tasks. One of the most challenging problems in the vision transformer is that …
of various vision tasks. One of the most challenging problems in the vision transformer is that …
Towards robust vision transformer
Abstract Recent advances on Vision Transformer (ViT) and its improved variants have
shown that self-attention-based networks surpass traditional Convolutional Neural Networks …
shown that self-attention-based networks surpass traditional Convolutional Neural Networks …