A survey on video moment localization

M Liu, L Nie, Y Wang, M Wang, Y Rui - ACM Computing Surveys, 2023 - dl.acm.org
Video moment localization, also known as video moment retrieval, aims to search a target
segment within a video described by a given natural language query. Beyond the task of …

Egocentric video-language pretraining

KQ Lin, J Wang, M Soldan, M Wray… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Video-Language Pretraining (VLP), which aims to learn transferable representation
to advance a wide range of video-text downstream tasks, has recently received increasing …

Revisiting the" video" in video-language understanding

S Buch, C Eyzaguirre, A Gaidon, J Wu… - Proceedings of the …, 2022 - openaccess.thecvf.com
What makes a video task uniquely suited for videos, beyond what can be understood from a
single image? Building on recent progress in self-supervised image-language models, we …

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W **g, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Support-set bottlenecks for video-text representation learning

M Patrick, PY Huang, Y Asano, F Metze… - arxiv preprint arxiv …, 2020 - arxiv.org
The dominant paradigm for learning video-text representations--noise contrastive learning--
increases the similarity of the representations of pairs of samples that are known to be …

Learning 2d temporal adjacent networks for moment localization with natural language

S Zhang, H Peng, J Fu, J Luo - Proceedings of the AAAI Conference on …, 2020 - ojs.aaai.org
We address the problem of retrieving a specific moment from an untrimmed video by a query
sentence. This is a challenging problem because a target moment may take place in …

Unloc: A unified framework for video localization tasks

S Yan, X **ong, A Nagrani, A Arnab… - Proceedings of the …, 2023 - openaccess.thecvf.com
While large-scale image-text pretrained models such as CLIP have been used for multiple
video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos …

Tubedetr: Spatio-temporal video grounding with transformers

A Yang, A Miech, J Sivic, I Laptev… - Proceedings of the …, 2022 - openaccess.thecvf.com
We consider the problem of localizing a spatio-temporal tube in a video corresponding to a
given text query. This is a challenging task that requires the joint and efficient modeling of …

Vidchapters-7m: Video chapters at scale

A Yang, A Nagrani, I Laptev, J Sivic… - Advances in Neural …, 2024 - proceedings.neurips.cc
Segmenting untrimmed videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due to the lack of …