A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Streaming dense video captioning

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com
An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …

End-to-end dense video captioning with parallel decoding

T Wang, R Zhang, Z Lu, F Zheng… - Proceedings of the …, 2021 - openaccess.thecvf.com
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

MS Wajid, H Terashima‐Marin, P Najafirad… - Engineering …, 2024 - Wiley Online Library
Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …

Text with knowledge graph augmented transformer for video captioning

X Gu, G Chen, Y Wang, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video captioning aims to describe the content of videos using natural language. Although
significant progress has been made, there is still much room to improve the performance for …

Crossclr: Cross-modal contrastive learning for multi-modal video representations

M Zolfaghari, Y Zhu, P Gehler… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs
from sets of negative samples. Recently, the principle has also been used to learn cross …

AAP-MIT: Attentive atrous pyramid network and memory incorporated transformer for multisentence video description

J Prudviraj, MI Reddy, C Vishnu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Generating multi-sentence descriptions for video is considered to be the most complex task
in computer vision and natural language understanding due to the intricate nature of video …

Coot: Cooperative hierarchical transformer for video-text representation learning

S Ging, M Zolfaghari, H Pirsiavash… - Advances in neural …, 2020 - proceedings.neurips.cc
Many real-world video-text tasks involve different levels of granularity, such as frames and
words, clip and sentences or videos and paragraphs, each with distinct semantics. In this …

Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …