Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W **g, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Knowing where to focus: Event-aware transformer for video grounding

J Jang, J Park, J Kim, H Kwon… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Recent DETR-based video grounding models have made the model directly predict moment
timestamps without any hand-crafted components, such as a pre-defined proposal or non …

You can ground earlier than see: An effective and efficient pipeline for temporal sentence grounding in compressed videos

X Fang, D Liu, P Zhou, G Nan - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target
moment semantically according to a sentence query. Although previous respectable works …

Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training

D Luo, J Huang, S Gong, H **… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
The correlation between the vision and text is essential for video moment retrieval (VMR),
however, existing methods heavily rely on separate pre-training feature extractors for visual …

Rethinking weakly-supervised video temporal grounding from a game perspective

X Fang, Z **ong, W Fang, X Qu, C Chen, J Dong… - … on Computer Vision, 2024 - Springer
This paper addresses the challenging task of weakly-supervised video temporal grounding.
Existing approaches are generally based on the moment proposal selection framework that …

Fewer steps, better performance: Efficient cross-modal clip trimming for video moment retrieval using language

X Fang, D Liu, W Fang, P Zhou, Z Xu, W Xu… - Proceedings of the …, 2024 - ojs.aaai.org
Given an untrimmed video and a sentence query, video moment retrieval using language
(VMR) aims to locate a target query-relevant moment. Since the untrimmed video is …

Hawkeye: Training video-text llms for grounding text in videos

Y Wang, X Meng, J Liang, Y Wang, Q Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Video-text Large Language Models (video-text LLMs) have shown remarkable performance
in answering questions and holding conversations on simple videos. However, they perform …

Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering

X Chen, Y Lin, Y Zhang, W Huang - European Conference on Computer …, 2024 - Springer
We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively
evaluate large vision-language models in open-ended video question answering. The …

Dual learning with dynamic knowledge distillation for partially relevant video retrieval

J Dong, M Zhang, Z Zhang, X Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with
short durations. However, in practice, videos are generally untrimmed containing much …

Partial annotation-based video moment retrieval via iterative learning

W Ji, R Liang, L Liao, H Fei, F Feng - Proceedings of the 31st ACM …, 2023 - dl.acm.org
Given a descriptive language query, Video Moment Retrieval (VMR) aims to seek the
corresponding semantic-consistent moment clip in the video, which is represented as a pair …