A review of multimodal human activity recognition with special emphasis on classification, applications, challenges and future directions

SK Yadav, K Tiwari, HM Pandey, SA Akbar - Knowledge-Based Systems, 2021‏ - Elsevier
Human activity recognition (HAR) is one of the most important and challenging problems in
the computer vision. It has critical application in wide variety of tasks including gaming …

Temporal action segmentation: An analysis of modern techniques

G Ding, F Sener, A Yao - IEEE Transactions on Pattern Analysis …, 2023‏ - ieeexplore.ieee.org
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in
minutes-long videos with multiple action classes. As a long-range video understanding task …

Embodiedgpt: Vision-language pre-training via embodied chain of thought

Y Mu, Q Zhang, M Hu, W Wang… - Advances in …, 2023‏ - proceedings.neurips.cc
Embodied AI is a crucial frontier in robotics, capable of planning and executing action
sequences for robots to accomplish long-horizon tasks in physical environments. In this …

Affordances from human videos as a versatile representation for robotics

S Bahl, R Mendonca, L Chen… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Building a robot that can understand and learn to interact by watching humans has inspired
several vision problems. However, despite some successful results on static datasets, it …

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Egocentric video-language pretraining

KQ Lin, J Wang, M Soldan, M Wray… - Advances in …, 2022‏ - proceedings.neurips.cc
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …

Anticipative video transformer

R Girdhar, K Grauman - Proceedings of the IEEE/CVF …, 2021‏ - openaccess.thecvf.com
Abstract We propose Anticipative Video Transformer (AVT), an end-to-end attention-based
video modeling architecture that attends to the previously observed video in order to …

Future transformer for long-term action anticipation

D Gong, J Lee, M Kim, SJ Ha… - Proceedings of the IEEE …, 2022‏ - openaccess.thecvf.com
The task of predicting future actions from a video is crucial for a real-world agent interacting
with others. When anticipating actions in the distant future, we humans typically consider …

Learning video representations using contrastive bidirectional transformer

C Sun, F Baradel, K Murphy, C Schmid - arxiv preprint arxiv:1906.05743, 2019‏ - arxiv.org
This paper proposes a self-supervised learning approach for video features that results in
significantly improved performance on downstream tasks (such as video classification …

Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

H Mittal, N Agarwal, SY Lo… - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
We introduce PlausiVL a large video-language model for anticipating action sequences that
are plausible in the real-world. While significant efforts have been made towards anticipating …