Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?

O Moutik, H Sekkat, S Tigani, A Chehri, R Saadane… - Sensors, 2023‏ - mdpi.com
Understanding actions in videos remains a significant challenge in computer vision, which
has been the subject of several pieces of research in the last decades. Convolutional neural …

Attention bottlenecks for multimodal fusion

A Nagrani, S Yang, A Arnab, A Jansen… - Advances in neural …, 2021‏ - proceedings.neurips.cc
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …

Video swin transformer

Z Liu, J Ning, Y Cao, Y Wei, Z Zhang… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
The vision community is witnessing a modeling shift from CNNs to Transformers, where pure
Transformer architectures have attained top accuracy on the major video recognition …

Vivit: A video vision transformer

A Arnab, M Dehghani, G Heigold… - Proceedings of the …, 2021‏ - openaccess.thecvf.com
We present pure-transformer based models for video classification, drawing upon the recent
success of such models in image classification. Our model extracts spatio-temporal tokens …

Crossvit: Cross-attention multi-scale vision transformer for image classification

CFR Chen, Q Fan, R Panda - Proceedings of the IEEE/CVF …, 2021‏ - openaccess.thecvf.com
The recently developed vision transformer (ViT) has achieved promising results on image
classification compared to convolutional neural networks. Inspired by this, in this paper, we …

Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text

H Akbari, L Yuan, R Qian… - Advances in …, 2021‏ - proceedings.neurips.cc
We present a framework for learning multimodal representations from unlabeled data using
convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer …

[PDF][PDF] Is space-time attention all you need for video understanding?

G Bertasius, H Wang, L Torresani - ICML, 2021‏ - proceedings.mlr.press
Training. We train our model for 15 epochs with an initial learning rate of 0.005, which is
divided by 10 at epochs 11, and 14. During training, we first resize the shorter side of the …

Bevt: Bert pretraining of video transformers

R Wang, D Chen, Z Wu, Y Chen… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
This paper studies the BERT pretraining of video transformers. It is a straightforward but
worth-studying extension given the recent success from BERT pretraining of image …

Tdn: Temporal difference networks for efficient action recognition

L Wang, Z Tong, B Ji, G Wu - Proceedings of the IEEE/CVF …, 2021‏ - openaccess.thecvf.com
Temporal modeling still remains challenging for action recognition in videos. To mitigate this
issue, this paper presents a new video architecture, termed as Temporal Difference Network …

Movinets: Mobile video networks for efficient video recognition

D Kondratyuk, L Yuan, Y Li, L Zhang… - Proceedings of the …, 2021‏ - openaccess.thecvf.com
Abstract We present Mobile Video Networks (MoViNets), a family of computation and
memory efficient video networks that can operate on streaming video for online inference …