Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

A convnet for the 2020s

Z Liu, H Mao, CY Wu, C Feichtenhofer… - Proceedings of the …, 2022 - openaccess.thecvf.com
The" Roaring 20s" of visual recognition began with the introduction of Vision Transformers
(ViTs), which quickly superseded ConvNets as the state-of-the-art image classification …

Convnext v2: Co-designing and scaling convnets with masked autoencoders

S Woo, S Debnath, R Hu, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Driven by improved architectures and better representation learning frameworks, the field of
visual recognition has enjoyed rapid modernization and performance boost in the early …

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc
Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

Exploring plain vision transformer backbones for object detection

Y Li, H Mao, R Girshick, K He - European conference on computer vision, 2022 - Springer
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for
object detection. This design enables the original ViT architecture to be fine-tuned for object …

Video swin transformer

Z Liu, J Ning, Y Cao, Y Wei, Z Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …

Masked autoencoders as spatiotemporal learners

C Feichtenhofer, Y Li, K He - Advances in neural …, 2022 - proceedings.neurips.cc
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to
spatiotemporal representation learning from videos. We randomly mask out spacetime …

A survey on vision transformer

K Han, Y Wang, H Chen, X Chen, J Guo… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Transformer, first applied to the field of natural language processing, is a type of deep neural
network mainly based on the self-attention mechanism. Thanks to its strong representation …

Cswin transformer: A general vision transformer backbone with cross-shaped windows

X Dong, J Bao, D Chen, W Zhang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract We present CSWin Transformer, an efficient and effective Transformer-based
backbone for general-purpose vision tasks. A challenging issue in Transformer design is …