From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation

X Wang, Q Huang, A Celikyilmaz… - Proceedings of the …, 2019 - openaccess.thecvf.com
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out
natural language instructions inside real 3D environments. In this paper, we study how to …

Multimodal intelligence: Representation learning, information fusion, and applications

C Zhang, Z Yang, X He, L Deng - IEEE Journal of Selected …, 2020 - ieeexplore.ieee.org
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …

Neural motifs: Scene graph parsing with global context

R Zellers, M Yatskar, S Thomson… - Proceedings of the …, 2018 - openaccess.thecvf.com
We investigate the problem of producing structured graph representations of visual scenes.
Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We …

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Y Goyal, T Khot, D Summers-Stay… - Proceedings of the …, 2017 - openaccess.thecvf.com
Problems at the intersection of vision and language are of significant importance both as
challenging research questions and for the rich set of applications they enable. However …

Deep learning in medical imaging: general overview

JG Lee, S Jun, YW Cho, H Lee… - Korean journal of …, 2017 - synapse.koreamed.org
The artificial neural network (ANN)–a machine learning technique inspired by the human
neuronal synapse system–was introduced in the 1950s. However, the ANN was previously …

Knowing when to look: Adaptive attention via a visual sentinel for image captioning

J Lu, C **ong, D Parikh… - Proceedings of the IEEE …, 2017 - openaccess.thecvf.com
Attention-based neural encoder-decoder frameworks have been widely adopted for image
captioning. Most methods force visual attention to be active for every generated word …

Neural baby talk

J Lu, J Yang, D Batra, D Parikh - Proceedings of the IEEE …, 2018 - openaccess.thecvf.com
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …

Online multi-object tracking with dual matching attention networks

J Zhu, H Yang, N Liu, M Kim… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we propose an online Multi-Object Tracking (MOT) approach which integrates
the merits of single object tracking and data association methods in a unified framework to …