Self-supervised learning for videos: A survey
The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …
large-scale annotated datasets. However, obtaining annotations is expensive and requires …
Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training
Skeleton sequence representation learning has shown great advantages for action
recognition due to its promising ability to model human joints and topology. However, the …
recognition due to its promising ability to model human joints and topology. However, the …
Fine-grained temporal contrastive learning for weakly-supervised temporal action localization
We target at the task of weakly-supervised action localization (WSAL), where only video-
level action labels are available during model training. Despite the recent progress, existing …
level action labels are available during model training. Despite the recent progress, existing …
Video-mined task graphs for keystep recognition in instructional videos
Procedural activity understanding requires perceiving human actions in terms of a broader
task, where multiple keysteps are performed in sequence across a long video to reach a …
task, where multiple keysteps are performed in sequence across a long video to reach a …
Learning to predict activity progress by self-supervised video alignment
In this paper we tackle the problem of self-supervised video alignment and activity progress
prediction using in-the-wild videos. Our proposed self-supervised representation learning …
prediction using in-the-wild videos. Our proposed self-supervised representation learning …
Progress-aware online action segmentation for egocentric procedural task videos
We address the problem of online action segmentation for egocentric procedural task
videos. While previous studies have mostly focused on offline action segmentation where …
videos. While previous studies have mostly focused on offline action segmentation where …
Stepformer: Self-supervised step discovery and localization in instructional videos
Instructional videos are an important resource to learn procedural tasks from human
demonstrations. However, the instruction steps in such videos are typically short and sparse …
demonstrations. However, the instruction steps in such videos are typically short and sparse …
Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment
The egocentric and exocentric viewpoints of a human activity look dramatically different, yet
invariant representations to link them are essential for many potential applications in …
invariant representations to link them are essential for many potential applications in …
Drop-dtw: Aligning common signal between sequences while drop** outliers
In this work, we consider the problem of sequence-to-sequence alignment for signals
containing outliers. Assuming the absence of outliers, the standard Dynamic Time War** …
containing outliers. Assuming the absence of outliers, the standard Dynamic Time War** …
Frame-wise action representations for long videos via sequence contrastive learning
Prior works on action representation learning mainly focus on designing various
architectures to extract the global representations for short video clips. In contrast, many …
architectures to extract the global representations for short video clips. In contrast, many …