A survey on video moment localization
Video moment localization, also known as video moment retrieval, aims to search a target
segment within a video described by a given natural language query. Beyond the task of …
segment within a video described by a given natural language query. Beyond the task of …
Egocentric video-language pretraining
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …
to advance a wide range of video-text downstream tasks, has recently received increasing …
Revisiting the" video" in video-language understanding
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …
single image? Building on recent progress in self-supervised image-language models, we …
Egovlpv2: Egocentric video-language pre-training with fusion in the backbone
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …
generalize to various vision and language tasks. However, existing egocentric VLP …
Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Support-set bottlenecks for video-text representation learning
The dominant paradigm for learning video-text representations--noise contrastive learning--
increases the similarity of the representations of pairs of samples that are known to be …
increases the similarity of the representations of pairs of samples that are known to be …
Learning 2d temporal adjacent networks for moment localization with natural language
We address the problem of retrieving a specific moment from an untrimmed video by a query
sentence. This is a challenging problem because a target moment may take place in …
sentence. This is a challenging problem because a target moment may take place in …
Unloc: A unified framework for video localization tasks
While large-scale image-text pretrained models such as CLIP have been used for multiple
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …
Tubedetr: Spatio-temporal video grounding with transformers
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …
given text query. This is a challenging task that requires the joint and efficient modeling of …
Vidchapters-7m: Video chapters at scale
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …
information of their interest. This important topic has been understudied due to the lack of …