Multi-modal dense video captioning

V Iashin, E Rahtu - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …

A better use of audio-visual cues: Dense video captioning with bi-modal transformer

V Iashin, E Rahtu - arxiv preprint arxiv:2005.08271, 2020 - arxiv.org
Dense video captioning aims to localize and describe important events in untrimmed videos.
Existing methods mainly tackle this task by exploiting only visual features, while completely …

Watch, listen and tell: Multi-modal weakly supervised dense event captioning

T Rahman, B Xu, L Sigal - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
Multi-modal learning, particularly among imaging and linguistic modalities, has made
amazing strides in many high-level fundamental visual understanding problems, ranging …

Temporal deformable convolutional encoder-decoder networks for video captioning

J Chen, Y Pan, Y Li, T Yao, H Chao, T Mei - Proceedings of the AAAI …, 2019 - ojs.aaai.org
It is well believed that video captioning is a fundamental but challenging task in both
computer vision and artificial intelligence fields. The prevalent approach is to map an input …

Language model agnostic gray-box adversarial attack on image captioning

N Aafaq, N Akhtar, W Liu, M Shah… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Adversarial susceptibility of neural image captioning is still under-explored due to the
complex multi-model nature of the task. We introduce a GAN-based adversarial attack to …

TAVT: Towards Transferable Audio-Visual Text Generation

W Lin, T **, W Pan, L Li, X Cheng… - Proceedings of the …, 2023 - aclanthology.org
Audio-visual text generation aims to understand multi-modality contents and translate them
into texts. Although various transfer learning techniques of text generation have been …

[HTML][HTML] Semantic similarity on multimodal data: A comprehensive survey with applications

B Ihnaini, B Abuhaija, EA Mills… - Journal of King Saud …, 2024 - Elsevier
Recently, the revival of the semantic similarity concept has been featured by the rapidly
growing artificial intelligence research fueled by advanced deep learning architectures …

Dense video captioning with early linguistic information fusion

N Aafaq, A Mian, N Akhtar, W Liu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Dense captioning methods generally detect events in videos first and then generate
captions for the individual events. Events are localized solely based on the visual cues while …

Deep reinforcement polishing network for video captioning

W Xu, J Yu, Z Miao, L Wan, Y Tian… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
The video captioning task aims to describe video content using several natural-language
sentences. Although one-step encoder-decoder models have achieved promising progress …

I2Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning

Y Tu, L Li, L Su, S Gao, C Yan, ZJ Zha… - … on Image Processing, 2022 - ieeexplore.ieee.org
TV show captioning aims to generate a linguistic sentence based on the video and its
associated subtitle. Compared to purely video-based captioning, the subtitle can provide the …