Deep learning-based action detection in untrimmed videos: A survey
Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …
applications, and is critical for video analysis. Despite the progress of action recognition …
Actionformer: Localizing moments of actions with transformers
Self-attention based Transformer models have demonstrated impressive results for image
classification and object detection, and more recently for video understanding. Inspired by …
classification and object detection, and more recently for video understanding. Inspired by …
Univtg: Towards unified video-language temporal grounding
Abstract Video Temporal Grounding (VTG), which aims to ground target clips from videos
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …
(such as consecutive intervals or disjoint shots) according to custom language queries (eg …
Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Video self-stitching graph network for temporal action localization
Temporal action localization (TAL) in videos is a challenging task, especially due to the
large variation in action temporal scales. Short actions usually occupy a major proportion in …
large variation in action temporal scales. Short actions usually occupy a major proportion in …
An empirical study of end-to-end temporal action detection
Temporal action detection (TAD) is an important yet challenging task in video
understanding. It aims to simultaneously predict the semantic label and the temporal interval …
understanding. It aims to simultaneously predict the semantic label and the temporal interval …
Locvtp: Video-text pre-training for temporal localization
Abstract Video-Text Pre-training (VTP) aims to learn transferable representations for various
downstream tasks from large-scale web videos. To date, almost all existing VTP methods …
downstream tasks from large-scale web videos. To date, almost all existing VTP methods …
Zero-shot temporal action detection via vision-language prompting
Existing temporal action detection (TAD) methods rely on large training data including
segment-level annotations, limited to recognizing previously seen classes alone during …
segment-level annotations, limited to recognizing previously seen classes alone during …
Cross-modal consensus network for weakly supervised temporal action localization
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to
localize action instances in the given video with video-level categorical supervision …
localize action instances in the given video with video-level categorical supervision …
Proposal-free temporal action detection via global segmentation mask learning
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly
large number of proposals per video. This leads to complex model designs due to proposal …
large number of proposals per video. This leads to complex model designs due to proposal …