A survey on video moment localization
Video moment localization, also known as video moment retrieval, aims to search a target
segment within a video described by a given natural language query. Beyond the task of …
segment within a video described by a given natural language query. Beyond the task of …
Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Deconfounded video moment retrieval with causal intervention
We tackle the task of video moment retrieval (VMR), which aims to localize a specific
moment in a video according to a textual query. Existing methods primarily model the …
moment in a video according to a textual query. Existing methods primarily model the …
Context-aware biaffine localizing network for temporal sentence grounding
This paper addresses the problem of temporal sentence grounding (TSG), which aims to
identify the temporal boundary of a specific segment from an untrimmed video by a sentence …
identify the temporal boundary of a specific segment from an untrimmed video by a sentence …
Mad: A scalable dataset for language grounding in videos from movie audio descriptions
The recent and increasing interest in video-language research has driven the development
of large-scale datasets that enable data-intensive machine learning techniques. In …
of large-scale datasets that enable data-intensive machine learning techniques. In …
Fast video moment retrieval
This paper targets at fast video moment retrieval (fast VMR), aiming to localize the target
moment efficiently and accurately as queried by a given natural language sentence. We …
moment efficiently and accurately as queried by a given natural language sentence. We …
Knowing where to focus: Event-aware transformer for video grounding
Recent DETR-based video grounding models have made the model directly predict moment
timestamps without any hand-crafted components, such as a pre-defined proposal or non …
timestamps without any hand-crafted components, such as a pre-defined proposal or non …
You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target
moment semantically according to a sentence query. Although previous respectable works …
moment semantically according to a sentence query. Although previous respectable works …
G2l: Semantically aligned and uniform video grounding via geodesic and game theory
The recent video grounding works attempt to introduce vanilla contrastive learning into video
grounding. However, we claim that this naive solution is suboptimal. Contrastive learning …
grounding. However, we claim that this naive solution is suboptimal. Contrastive learning …
Locvtp: Video-text pre-training for temporal localization
Abstract Video-Text Pre-training (VTP) aims to learn transferable representations for various
downstream tasks from large-scale web videos. To date, almost all existing VTP methods …
downstream tasks from large-scale web videos. To date, almost all existing VTP methods …