Multi-modal dense video captioning
Dense video captioning is a task of localizing interesting events from an untrimmed video
and producing textual description (captions) for each localized event. Most of the previous …
and producing textual description (captions) for each localized event. Most of the previous …
A better use of audio-visual cues: Dense video captioning with bi-modal transformer
Dense video captioning aims to localize and describe important events in untrimmed videos.
Existing methods mainly tackle this task by exploiting only visual features, while completely …
Existing methods mainly tackle this task by exploiting only visual features, while completely …
Watch, listen and tell: Multi-modal weakly supervised dense event captioning
Multi-modal learning, particularly among imaging and linguistic modalities, has made
amazing strides in many high-level fundamental visual understanding problems, ranging …
amazing strides in many high-level fundamental visual understanding problems, ranging …
Temporal deformable convolutional encoder-decoder networks for video captioning
It is well believed that video captioning is a fundamental but challenging task in both
computer vision and artificial intelligence fields. The prevalent approach is to map an input …
computer vision and artificial intelligence fields. The prevalent approach is to map an input …
Language model agnostic gray-box adversarial attack on image captioning
Adversarial susceptibility of neural image captioning is still under-explored due to the
complex multi-model nature of the task. We introduce a GAN-based adversarial attack to …
complex multi-model nature of the task. We introduce a GAN-based adversarial attack to …
TAVT: Towards Transferable Audio-Visual Text Generation
Audio-visual text generation aims to understand multi-modality contents and translate them
into texts. Although various transfer learning techniques of text generation have been …
into texts. Although various transfer learning techniques of text generation have been …
[HTML][HTML] Semantic similarity on multimodal data: A comprehensive survey with applications
Recently, the revival of the semantic similarity concept has been featured by the rapidly
growing artificial intelligence research fueled by advanced deep learning architectures …
growing artificial intelligence research fueled by advanced deep learning architectures …
Dense video captioning with early linguistic information fusion
Dense captioning methods generally detect events in videos first and then generate
captions for the individual events. Events are localized solely based on the visual cues while …
captions for the individual events. Events are localized solely based on the visual cues while …
Deep reinforcement polishing network for video captioning
W Xu, J Yu, Z Miao, L Wan, Y Tian… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
The video captioning task aims to describe video content using several natural-language
sentences. Although one-step encoder-decoder models have achieved promising progress …
sentences. Although one-step encoder-decoder models have achieved promising progress …
I2Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning
TV show captioning aims to generate a linguistic sentence based on the video and its
associated subtitle. Compared to purely video-based captioning, the subtitle can provide the …
associated subtitle. Compared to purely video-based captioning, the subtitle can provide the …