A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions

SK Yadav, K Tiwari, HM Pandey, SA Akbar - Knowledge-Based Systems, 2021 - Elsevier
Human activity recognition (HAR) is one of the most important and challenging problems in
the computer vision. It has critical application in wide variety of tasks including gaming …

Convolutional neural networks or vision transformers: Who will win the race for action recognitions in visual data?

O Moutik, H Sekkat, S Tigani, A Chehri, R Saadane… - Sensors, 2023 - mdpi.com
Understanding actions in videos remains a significant challenge in computer vision, which
has been the subject of several pieces of research in the last decades. Convolutional neural …

Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Z Tong, Y Song, J Wang… - Advances in neural …, 2022 - proceedings.neurips.cc
Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …

Uniformer: Unifying convolution and self-attention for visual recognition

K Li, Y Wang, J Zhang, P Gao, G Song… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
It is a challenging task to learn discriminative representation from images and videos, due to
large local redundancy and complex global dependency in these visual data. Convolution …

Actionclip: A new paradigm for video action recognition

M Wang, J **ng, Y Liu - arxiv preprint arxiv:2109.08472, 2021 - arxiv.org
The canonical approach to video action recognition dictates a neural model to do a classic
and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined …

Tdn: Temporal difference networks for efficient action recognition

L Wang, Z Tong, B Ji, G Wu - Proceedings of the IEEE/CVF …, 2021 - openaccess.thecvf.com
Temporal modeling still remains challenging for action recognition in videos. To mitigate this
issue, this paper presents a new video architecture, termed as Temporal Difference Network …

Uniformer: Unified transformer for efficient spatiotemporal representation learning

K Li, Y Wang, P Gao, G Song, Y Liu, H Li… - arxiv preprint arxiv …, 2022 - arxiv.org
It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-
dimensional videos, due to large local redundancy and complex global dependency …

Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models

W Wu, X Wang, H Luo, J Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability on various visual tasks. Transferring knowledge …

Vidtr: Video transformer without convolutions

Y Zhang, X Li, C Liu, B Shuai, Y Zhu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We introduce Video Transformer (VidTr) with separable-attention for video
classification. Comparing with commonly used 3D networks, VidTr is able to aggregate …