Deep image captioning: A review of methods, trends and future challenges

L Xu, Q Tang, J Lv, B Zheng, X Zeng, W Li - Neurocomputing, 2023 - Elsevier
Image captioning, also called report generation in medical field, aims to describe visual
content of images in human language, which requires to model semantic relationship …

Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap

S Amirian, K Rasheed, TR Taha, HR Arabnia - IEEE access, 2020 - ieeexplore.ieee.org
Methodologies that utilize Deep Learning offer great potential for applications that
automatically attempt to generate captions or descriptions about images and video frames …

How much can clip benefit vision-and-language tasks?

S Shen, LH Li, H Tan, M Bansal, A Rohrbach… - arxiv preprint arxiv …, 2021 - arxiv.org
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …

Auto-encoding scene graphs for image captioning

X Yang, K Tang, H Zhang, J Cai - Proceedings of the IEEE …, 2019 - openaccess.thecvf.com
Abstract We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language
inductive bias into the encoder-decoder image captioning framework for more human-like …

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

AC Cheng, H Yin, Y Fu, Q Guo, R Yang, J Kautz… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision
and language tasks. However, their ability to reason about spatial arrangements remains …

Causal attention for vision-language tasks

X Yang, H Zhang, G Qi, J Cai - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We present a novel attention mechanism: Causal Attention (CATT), to remove the ever-
elusive confounding effect in existing attention-based vision-language models. This effect …

Object hallucination in image captioning

A Rohrbach, LA Hendricks, K Burns, T Darrell… - arxiv preprint arxiv …, 2018 - arxiv.org
Despite continuously improving performance, contemporary image captioning models are
prone to" hallucinating" objects that are not actually in a scene. One problem is that standard …

On hallucination and predictive uncertainty in conditional language generation

Y **ao, WY Wang - arxiv preprint arxiv:2103.15025, 2021 - arxiv.org
Despite improvements in performances on different natural language generation tasks, deep
neural models are prone to hallucinating facts that are incorrect or nonexistent. Different …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaid, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Understanding and evaluating racial biases in image captioning

D Zhao, A Wang… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Image captioning is an important task for benchmarking visual reasoning and for enabling
accessibility for people with vision impairments. However, as in many machine learning …