Videomae v2: Scaling video masked autoencoders with dual masking

L Wang, B Huang, Z Zhao, Z Tong… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …

Mgmae: Motion guided masking for video masked autoencoding

B Huang, Z Zhao, G Zhang, Y Qiao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked autoencoding has shown excellent performance on self-supervised video
representation learning. Temporal redundancy has led to a high masking ratio and …

Learning to predict activity progress by self-supervised video alignment

G Donahue, E Elhamifar - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
In this paper we tackle the problem of self-supervised video alignment and activity progress
prediction using in-the-wild videos. Our proposed self-supervised representation learning …

Masked modeling for self-supervised representation learning on vision and beyond

S Li, L Zhang, Z Wang, D Wu, L Wu, Z Liu, J **a… - arxiv preprint arxiv …, 2023 - arxiv.org
As the deep learning revolution marches on, self-supervised learning has garnered
increasing attention in recent years thanks to its remarkable representation learning ability …

From static to dynamic: Adapting landmark-aware image models for facial expression recognition in videos

Y Chen, J Li, S Shan, M Wang… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Dynamic facial expression recognition (DFER) in the wild is still hindered by data limitations,
eg, insufficient quantity and diversity of pose, occlusion and illumination, as well as the …

Masked motion encoding for self-supervised video representation learning

X Sun, P Chen, L Chen, C Li, TH Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
How to learn discriminative video representation from unlabeled videos is challenging but
crucial for video analysis. The latest attempts seek to learn a representation model by …

Masked autoencoders in computer vision: A comprehensive survey

Z Zhou, X Liu - IEEE Access, 2023 - ieeexplore.ieee.org
Masked autoencoders (MAE) is a deep learning method based on Transformer. Originally
used for images, it has now been extended to video, audio, and some other temporal …

Darkness-adaptive action recognition: Leveraging efficient tubelet slow-fast network for industrial applications

M Munsif, N Khan, A Hussain, MJ Kim… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Infrared (IR) technology has emerged as a solution for monitoring dark environments. It
offers resilience to shifting illumination, appearance changes, and shadows, with …

Contextual visual and motion salient fusion framework for action recognition in dark environments

M Munsif, SU Khan, N Khan, A Hussain, MJ Kim… - Knowledge-Based …, 2024 - Elsevier
Infrared (IR) human action recognition (AR) exhibits resilience against shifting illumination
conditions, changes in appearance, and shadows. It has valuable applications in numerous …

Ams-net: Modeling adaptive multi-granularity spatio-temporal cues for video action recognition

Q Wang, Q Hu, Z Gao, P Li, Q Hu - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Effective spatio-temporal modeling as a core of video representation learning is challenged
by complex scale variations in spatio-temporal cues in videos, especially different visual …