A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019 - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Deep learning for image-to-text generation: A technical overview

X He, L Deng - IEEE Signal Processing Magazine, 2017 - ieeexplore.ieee.org
Generating a natural language description from an image is an emerging interdisciplinary
problem at the intersection of computer vision, natural language processing, and artificial …

Gqa: A new dataset for real-world visual reasoning and compositional question answering

DA Hudson, CD Manning - … of the IEEE/CVF conference on …, 2019 - openaccess.thecvf.com
We introduce GQA, a new dataset for real-world visual reasoning and compositional
question answering, seeking to address key shortcomings of previous VQA datasets. We …

[HTML][HTML] Attention gated networks: Learning to leverage salient regions in medical images

J Schlemper, O Oktay, M Schaap, M Heinrich… - Medical image …, 2019 - Elsevier
We propose a novel attention gate (AG) model for medical image analysis that automatically
learns to focus on target structures of varying shapes and sizes. Models trained with AGs …

Attention u-net: Learning where to look for the pancreas

O Oktay, J Schlemper, LL Folgoc, M Lee… - arxiv preprint arxiv …, 2018 - arxiv.org
We propose a novel attention gate (AG) model for medical imaging that automatically learns
to focus on target structures of varying shapes and sizes. Models trained with AGs implicitly …

Bottom-up abstractive summarization

S Gehrmann, Y Deng, AM Rush - arxiv preprint arxiv:1808.10792, 2018 - arxiv.org
Neural network-based methods for abstractive summarization produce outputs that are more
fluent than other techniques, but which can be poor at content selection. This work proposes …

Film: Visual reasoning with a general conditioning layer

E Perez, F Strub, H De Vries, V Dumoulin… - Proceedings of the …, 2018 - ojs.aaai.org
We introduce a general-purpose conditioning method for neural networks called FiLM:
Feature-wise Linear Modulation. FiLM layers influence neural network computation via a …

Camp: Cross-modal adaptive message passing for text-image retrieval

Z Wang, X Liu, H Li, L Sheng, J Yan… - Proceedings of the …, 2019 - openaccess.thecvf.com
Text-image cross-modal retrieval is a challenging task in the field of language and vision.
Most previous approaches independently embed images and sentences into a joint …

Learning by abstraction: The neural state machine

D Hudson, CD Manning - Advances in neural information …, 2019 - proceedings.neurips.cc
Abstract We introduce the Neural State Machine, seeking to bridge the gap between the
neural and symbolic views of AI and integrate their complementary strengths for the task of …

Reclip: A strong zero-shot baseline for referring expression comprehension

S Subramanian, W Merrill, T Darrell, M Gardner… - arxiv preprint arxiv …, 2022 - arxiv.org
Training a referring expression comprehension (ReC) model for a new visual domain
requires collecting referring expressions, and potentially corresponding bounding boxes, for …