A review of deep learning for video captioning

M Abdar, M Kollati, S Kuraparthi… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that comprises
contributions from domains such as computer vision, natural language processing …

Video description: A comprehensive survey of deep learning approaches

G Rafiq, M Rafiq, GS Choi - Artificial Intelligence Review, 2023 - Springer
Video description refers to understanding visual content and transforming that acquired
understanding into automatic textual narration. It bridges the key AI fields of computer vision …

Adapt: Action-aware driving caption transformer

B **, X Liu, Y Zheng, P Li, H Zhao… - … on Robotics and …, 2023 - ieeexplore.ieee.org
End-to-end autonomous driving has great potential in the transportation industry. However,
the lack of transparency and interpretability of the automatic decision-making process …

Video captioning using global-local representation

L Yan, S Ma, Q Wang, Y Chen, X Zhang… - … on Circuits and …, 2022 - ieeexplore.ieee.org
Video captioning is a challenging task as it needs to accurately transform visual
understanding into natural language description. To date, state-of-the-art methods …

Exploring group video captioning with efficient relational approximation

W Lin, T **, Y Wang, W Pan, L Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Current video captioning efforts most focus on describing a single video while the need for
captioning videos in groups has increased considerably. In this study, we propose a new …

Refined semantic enhancement towards frequency diffusion for video captioning

X Zhong, Z Li, S Chen, K Jiang, C Chen… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Video captioning aims to generate natural language sentences that describe the given video
accurately. Existing methods obtain favorable generation by exploring richer visual …

TAVT: Towards Transferable Audio-Visual Text Generation

W Lin, T **, W Pan, L Li, X Cheng… - Proceedings of the …, 2023 - aclanthology.org
Audio-visual text generation aims to understand multi-modality contents and translate them
into texts. Although various transfer learning techniques of text generation have been …

Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions

D Curto, A Clapés, J Selva… - Proceedings of the …, 2021 - openaccess.thecvf.com
Personality computing has become an emerging topic in computer vision, due to the wide
range of applications it can be used for. However, most works on the topic have focused on …

Evolution of visual data captioning Methods, Datasets, and evaluation Metrics: A comprehensive survey

D Sharma, C Dhiman, D Kumar - Expert Systems with Applications, 2023 - Elsevier
Abstract Automatic Visual Captioning (AVC) generates syntactically and semantically correct
sentences by describing important objects, attributes, and their relationships with each other …

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

MS Wajid, H Terashima‐Marin, P Najafirad… - Engineering …, 2024 - Wiley Online Library
Generating an image/video caption has always been a fundamental problem of Artificial
Intelligence, which is usually performed using the potential of Deep Learning Methods …