From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022‏ - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019‏ - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Survey of the state of the art in natural language generation: Core tasks, applications and evaluation

A Gatt, E Krahmer - Journal of Artificial Intelligence Research, 2018‏ - jair.org
This paper surveys the current state of the art in Natural Language Generation (NLG),
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

O Vinyals, A Toshev, S Bengio… - IEEE transactions on …, 2016‏ - ieeexplore.ieee.org
Automatically describing the content of an image is a fundamental problem in artificial
intelligence that connects computer vision and natural language processing. In this paper …

Microsoft coco captions: Data collection and evaluation server

X Chen, H Fang, TY Lin, R Vedantam, S Gupta… - arxiv preprint arxiv …, 2015‏ - arxiv.org
In this paper we describe the Microsoft COCO Caption dataset and evaluation server. When
completed, the dataset will contain over one and a half million captions describing over …

Show, attend and tell: Neural image caption generation with visual attention

K Xu, J Ba, R Kiros, K Cho, A Courville… - International …, 2015‏ - proceedings.mlr.press
Inspired by recent work in machine translation and object detection, we introduce an
attention based model that automatically learns to describe the content of images. We …

Deep visual-semantic alignments for generating image descriptions

A Karpathy, L Fei-Fei - Proceedings of the IEEE conference on …, 2015‏ - cv-foundation.org
We present a model that generates natural language descriptions of images and their
regions. Our approach leverages datasets of images and their sentence descriptions to …

Long-term recurrent convolutional networks for visual recognition and description

J Donahue, L Anne Hendricks… - Proceedings of the …, 2015‏ - openaccess.thecvf.com
Abstract Models comprised of deep convolutional network layers have dominated recent
image interpretation tasks; we investigate whether models which are also compositional, or" …

Show and tell: A neural image caption generator

O Vinyals, A Toshev, S Bengio… - Proceedings of the IEEE …, 2015‏ - cv-foundation.org
Automatically describing the content of an image is a fundamental problem in artificial
intelligence that connects computer vision and natural language processing. In this paper …

Unifying visual-semantic embeddings with multimodal neural language models

R Kiros, R Salakhutdinov, RS Zemel - arxiv preprint arxiv:1411.2539, 2014‏ - arxiv.org
Inspired by recent advances in multimodal learning and machine translation, we introduce
an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with …