- Academic Search

E Vahdani, Y Tian - IEEE Transactions on Pattern Analysis and …, 2022 - ieeexplore.ieee.org

Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …

Zapisz Cytuj Cytowane przez 75 Powiązane artykuły Wszystkie wersje 8

[Free GPT-4]

[PDF] thecvf.com

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com

In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Zapisz Cytuj Cytowane przez 238 Powiązane artykuły Wszystkie wersje 26 Wersja HTML

[Free GPT-4]

[PDF] arxiv.org

Temporal sentence grounding in videos: A survey and future directions

H Zhang, A Sun, W **g, JT Zhou - IEEE Transactions on …, 2023 - ieeexplore.ieee.org

Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …

Zapisz Cytuj Cytowane przez 53 Powiązane artykuły Wszystkie wersje 8

[Free GPT-4]

[PDF] arxiv.org

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

F Zhou, B Williams, H Rahmani - European Conference on Computer …, 2024 - Springer

Alleviating noisy pseudo labels remains a key challenge in Semi-Supervised Temporal
Action Localization (SS-TAL). Existing methods often filter pseudo labels based on strict …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 6

[Free GPT-4]

[PDF] arxiv.org

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

J Qi, K Ji, J Yu, D Wang, B Xu, L Hou, J Li - ar**, R Basri… - The Thirty-eighth Annual … - openreview.net

The recent emergence of powerful Vision-Language models (VLMs) has significantly
improved image captioning. Some of these models are extended to caption videos as well …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]

[PDF] ssrn.com

Vidcap-Llm: Vision-Transformer and Large Language Model for Video Captioning with Linguistic Semantics Integration

A Tariq, M Elhadef, MU Ghani Khan - Available at SSRN 4812289 - papers.ssrn.com

Video captioning models produce textual descriptions based on content, emphasizing the
pivotal role of representation learning. Conventional methods are primarily designed within …

Zapisz Cytuj Powiązane artykuły Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Contrastive language-action pre-training for temporal localization

Deep learning-based action detection in untrimmed videos: A survey

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Temporal sentence grounding in videos: A survey and future directions

Towards Adaptive Pseudo-label Learning for Semi-Supervised Temporal Action Localization

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

Vidcap-Llm: Vision-Transformer and Large Language Model for Video Captioning with Linguistic Semantics Integration