From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
A comprehensive survey of deep learning for image captioning
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …
recognizing the important objects, their attributes, and their relationships in an image. It also …
Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out
natural language instructions inside real 3D environments. In this paper, we study how to …
natural language instructions inside real 3D environments. In this paper, we study how to …
Multimodal intelligence: Representation learning, information fusion, and applications
Deep learning methods haverevolutionized speech recognition, image recognition, and
natural language processing since 2010. Each of these tasks involves a single modality in …
natural language processing since 2010. Each of these tasks involves a single modality in …
Neural motifs: Scene graph parsing with global context
We investigate the problem of producing structured graph representations of visual scenes.
Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We …
Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We …
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Problems at the intersection of vision and language are of significant importance both as
challenging research questions and for the rich set of applications they enable. However …
challenging research questions and for the rich set of applications they enable. However …
Deep learning in medical imaging: general overview
The artificial neural network (ANN)–a machine learning technique inspired by the human
neuronal synapse system–was introduced in the 1950s. However, the ANN was previously …
neuronal synapse system–was introduced in the 1950s. However, the ANN was previously …
Knowing when to look: Adaptive attention via a visual sentinel for image captioning
Attention-based neural encoder-decoder frameworks have been widely adopted for image
captioning. Most methods force visual attention to be active for every generated word …
captioning. Most methods force visual attention to be active for every generated word …
Neural baby talk
We introduce a novel framework for image captioning that can produce natural language
explicitly grounded in entities that object detectors find in the image. Our approach …
explicitly grounded in entities that object detectors find in the image. Our approach …
Online multi-object tracking with dual matching attention networks
In this paper, we propose an online Multi-Object Tracking (MOT) approach which integrates
the merits of single object tracking and data association methods in a unified framework to …
the merits of single object tracking and data association methods in a unified framework to …