A comprehensive survey of transformers for computer vision
As a special type of transformer, vision transformers (ViTs) can be used for various computer
vision (CV) applications. Convolutional neural networks (CNNs) have several potential …
vision (CV) applications. Convolutional neural networks (CNNs) have several potential …
Flatten transformer: Vision transformer using focused linear attention
The quadratic computation complexity of self-attention has been a persistent challenge
when applying Transformer models to vision tasks. Linear attention, on the other hand, offers …
when applying Transformer models to vision tasks. Linear attention, on the other hand, offers …
Measuring and narrowing the compositionality gap in language models
We investigate the ability of language models to perform compositional reasoning tasks
where the overall solution depends on correctly composing the answers to sub-problems …
where the overall solution depends on correctly composing the answers to sub-problems …
Vision transformer with deformable attention
Transformers have recently shown superior performances on various vision tasks. The large,
sometimes even global, receptive field endows Transformer models with higher …
sometimes even global, receptive field endows Transformer models with higher …
A survey of visual transformers
Transformer, an attention-based encoder–decoder model, has already revolutionized the
field of natural language processing (NLP). Inspired by such significant achievements, some …
field of natural language processing (NLP). Inspired by such significant achievements, some …
Dynamic neural networks: A survey
Dynamic neural network is an emerging research topic in deep learning. Compared to static
models which have fixed computational graphs and parameters at the inference stage …
models which have fixed computational graphs and parameters at the inference stage …
Adaptive rotated convolution for rotated object detection
Rotated object detection aims to identify and locate objects in images with arbitrary
orientation. In this scenario, the oriented directions of objects vary considerably across …
orientation. In this scenario, the oriented directions of objects vary considerably across …
Not all patches are what you need: Expediting vision transformers via token reorganizations
Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head
self-attention (MHSA) among them. Complete leverage of these image tokens brings …
self-attention (MHSA) among them. Complete leverage of these image tokens brings …
Flexivit: One model for all patch sizes
Vision Transformers convert images to sequences by slicing them into patches. The size of
these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher …
these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher …
Adavit: Adaptive vision transformers for efficient image recognition
Built on top of self-attention mechanisms, vision transformers have demonstrated
remarkable performance on a variety of vision tasks recently. While achieving excellent …
remarkable performance on a variety of vision tasks recently. While achieving excellent …