Human action recognition from various data modalities: A review

Z Sun, Q Ke, H Rahmani, M Bennamoun… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …

Vision transformers for action recognition: A survey

A Ulhaq, N Akhtar, G Pogrebna, A Mian - ar** and recognition are important components of visual scene understanding, eg, for
object detection and semantic segmentation. With end-to-end deep learning systems …

Frozen clip models are efficient video learners

Z Lin, S Geng, R Zhang, P Gao, G De Melo… - … on Computer Vision, 2022 - Springer
Video recognition has been dominated by the end-to-end learning paradigm–first initializing
a video recognition model with weights of a pretrained image model and then conducting …

Masked feature prediction for self-supervised visual pre-training

C Wei, H Fan, S **e, CY Wu, A Yuille… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training
of video models. Our approach first randomly masks out a portion of the input sequence and …

Florence: A new foundation model for computer vision

L Yuan, D Chen, YL Chen, N Codella, X Dai… - arxiv preprint arxiv …, 2021 - arxiv.org
Automated visual understanding of our diverse and open world demands computer vision
models to generalize well with minimal customization for specific tasks, similar to human …

Multiview transformers for video recognition

S Yan, X **ong, A Arnab, Z Lu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Video understanding requires reasoning at multiple spatiotemporal resolutions--from short
fine-grained motions to events taking place over longer durations. Although transformer …

Open-world object manipulation using pre-trained vision-language models

A Stone, T **ao, Y Lu, K Gopalakrishnan… - arxiv preprint arxiv …, 2023 - arxiv.org
For robots to follow instructions from people, they must be able to connect the rich semantic
information in human vocabulary, eg" can you get me the pink stuffed whale?" to their …