Deep learning-based action detection in untrimmed videos: A survey

E Vahdani, Y Tian - IEEE Transactions on Pattern Analysis and …, 2022 - ieeexplore.ieee.org
Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W **g, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

F Zhou, B Williams, H Rahmani - European Conference on Computer …, 2024 - Springer
Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal
Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict …

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

J Qi, K Ji, J Yu, D Wang, B Xu, L Hou, J Li - ar**, R Basri… - The Thirty-eighth Annual … - openreview.net
The recent emergence of powerful Vision-Language models (VLMs) has significantly
improved image captioning. Some of these models are extended to caption videos as well …

Vidcap-Llm: Vision-Transformer and Large Language Model for Video Captioning with Linguistic Semantics Integration

A Tariq, M Elhadef, MU Ghani Khan - Available at SSRN 4812289 - papers.ssrn.com
Video captioning models produce textual descriptions based on content, emphasizing the
pivotal role of representation learning. Conventional methods are primarily designed within …