From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

P Sharma, N Ding, S Goodman… - Proceedings of the 56th …, 2018 - aclanthology.org
We present a new dataset of image caption annotations, Conceptual Captions, which
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …

Deep multimodal representation learning: A survey

W Guo, J Wang, S Wang - Ieee Access, 2019 - ieeexplore.ieee.org
Multimodal representation learning, which aims to narrow the heterogeneity gap among
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal …

Multimodal machine learning: A survey and taxonomy

T Baltrušaitis, C Ahuja… - IEEE transactions on …, 2018 - ieeexplore.ieee.org
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

A Gatt, E Krahmer - Journal of Artificial Intelligence Research, 2018 - jair.org
This paper surveys the current state of the art in Natural Language Generation (NLG),
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …

Spice: Semantic propositional image caption evaluation

P Anderson, B Fernando, M Johnson… - Computer Vision–ECCV …, 2016 - Springer
There is considerable interest in the task of automatically generating image captions.
However, evaluation is challenging. Existing automatic evaluation metrics are primarily …

Remind your neural network to prevent catastrophic forgetting

TL Hayes, K Kafle, R Shrestha, M Acharya… - European conference on …, 2020 - Springer
People learn throughout life. However, incrementally updating conventional neural networks
leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the …

What makes training multi-modal classification networks hard?

W Wang, D Tran, M Feiszli - … of the IEEE/CVF conference on …, 2020 - openaccess.thecvf.com
Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …

Visual translation embedding network for visual relation detection

H Zhang, Z Kyaw, SF Chang… - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
Visual relations, such as" person ride bike" and" bike next to car", offer a comprehensive
scene understanding of an image, and have already shown their great utility in connecting …