Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Coot: Cooperative hierarchical transformer for video-text representation learning
Many real-world video-text tasks involve different levels of granularity, such as frames and
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …
Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning
Temporal sentence grounding aims to detect the most salient moment corresponding to the
natural language query from untrimmed videos. As labeling the temporal boundaries is labor …
natural language query from untrimmed videos. As labeling the temporal boundaries is labor …
Counterfactual contrastive learning for weakly-supervised vision-language grounding
Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …
or a specific region in an image according to the given sentence query, where only video …
Rethinking weakly-supervised video temporal grounding from a game perspective
This paper addresses the challenging task of weakly-supervised video temporal grounding.
Existing approaches are generally based on the moment proposal selection framework that …
Existing approaches are generally based on the moment proposal selection framework that …
Weakly supervised video moment localization with contrastive negative sample mining
Video moment localization aims at localizing the video segments which are most related to
the given free-form natural language query. The weakly supervised setting, where only …
the given free-form natural language query. The weakly supervised setting, where only …
Weakly supervised temporal sentence grounding with uncertainty-guided self-training
The task of weakly supervised temporal sentence grounding aims at finding the
corresponding temporal moments of a language description in the video, given video …
corresponding temporal moments of a language description in the video, given video …
Cascaded prediction network via segment tree for temporal video grounding
Temporal video grounding aims to localize the target segment which is semantically aligned
with the given sentence in an untrimmed video. Existing methods can be divided into two …
with the given sentence in an untrimmed video. Existing methods can be divided into two …
Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning
Abstract Dense Event Captioning (DEC) aims to jointly localize and describe multiple events
of interest in untrimmed videos, which is an advancement of the conventional video …
of interest in untrimmed videos, which is an advancement of the conventional video …
Learning video moment retrieval without a single annotated video
Video moment retrieval has progressed significantly over the past few years, aiming to
search the moment that is most relevant to a given natural language query. Most existing …
search the moment that is most relevant to a given natural language query. Most existing …