From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
A comprehensive survey of deep learning for image captioning
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …
recognizing the important objects, their attributes, and their relationships in an image. It also …
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
We present a new dataset of image caption annotations, Conceptual Captions, which
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …
contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) …
Deep multimodal representation learning: A survey
W Guo, J Wang, S Wang - Ieee Access, 2019 - ieeexplore.ieee.org
Multimodal representation learning, which aims to narrow the heterogeneity gap among
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal …
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal …
Multimodal machine learning: A survey and taxonomy
Our experience of the world is multimodal-we see objects, hear sounds, feel texture, smell
odors, and taste flavors. Modality refers to the way in which something happens or is …
odors, and taste flavors. Modality refers to the way in which something happens or is …
Survey of the state of the art in natural language generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language Generation (NLG),
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …
defined as the task of generating text or speech from non-linguistic input. A survey of NLG is …
Spice: Semantic propositional image caption evaluation
There is considerable interest in the task of automatically generating image captions.
However, evaluation is challenging. Existing automatic evaluation metrics are primarily …
However, evaluation is challenging. Existing automatic evaluation metrics are primarily …
Remind your neural network to prevent catastrophic forgetting
People learn throughout life. However, incrementally updating conventional neural networks
leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the …
leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the …
What makes training multi-modal classification networks hard?
Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with
multiple input modalities: the multi-modal network receives more information, so it should …
multiple input modalities: the multi-modal network receives more information, so it should …
Visual translation embedding network for visual relation detection
Visual relations, such as" person ride bike" and" bike next to car", offer a comprehensive
scene understanding of an image, and have already shown their great utility in connecting …
scene understanding of an image, and have already shown their great utility in connecting …