A comprehensive survey of deep learning for image captioning
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …
recognizing the important objects, their attributes, and their relationships in an image. It also …
Deep learning for image-to-text generation: A technical overview
Generating a natural language description from an image is an emerging interdisciplinary
problem at the intersection of computer vision, natural language processing, and artificial …
problem at the intersection of computer vision, natural language processing, and artificial …
Gqa: A new dataset for real-world visual reasoning and compositional question answering
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …
question answering, seeking to address key shortcomings of previous VQA datasets. We …
[HTML][HTML] Attention gated networks: Learning to leverage salient regions in medical images
We propose a novel attention gate (AG) model for medical image analysis that automatically
learns to focus on target structures of varying shapes and sizes. Models trained with AGs …
learns to focus on target structures of varying shapes and sizes. Models trained with AGs …
Attention u-net: Learning where to look for the pancreas
We propose a novel attention gate (AG) model for medical imaging that automatically learns
to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly …
to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly …
Bottom-up abstractive summarization
Neural network-based methods for abstractive summarization produce outputs that are more
fluent than other techniques, but which can be poor at content selection. This work proposes …
fluent than other techniques, but which can be poor at content selection. This work proposes …
Film: Visual reasoning with a general conditioning layer
We introduce a general-purpose conditioning method for neural networks called FiLM:
Feature-wise Linear Modulation. FiLM layers influence neural network computation via a …
Feature-wise Linear Modulation. FiLM layers influence neural network computation via a …
Camp: Cross-modal adaptive message passing for text-image retrieval
Text-image cross-modal retrieval is a challenging task in the field of language and vision.
Most previous approaches independently embed images and sentences into a joint …
Most previous approaches independently embed images and sentences into a joint …
Learning by abstraction: The neural state machine
Abstract We introduce the Neural State Machine, seeking to bridge the gap between the
neural and symbolic views of AI and integrate their complementary strengths for the task of …
neural and symbolic views of AI and integrate their complementary strengths for the task of …
Reclip: A strong zero-shot baseline for referring expression comprehension
Training a referring expression comprehension (ReC) model for a new visual domain
requires collecting referring expressions, and potentially corresponding bounding boxes, for …
requires collecting referring expressions, and potentially corresponding bounding boxes, for …