From show to tell: A survey on deep learning-based image captioning

M Stefanini, M Cornia, L Baraldi… - IEEE transactions on …, 2022 - ieeexplore.ieee.org
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …

Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

Positive-augmented contrastive learning for image and video captioning evaluation

S Sarto, M Barraco, M Cornia… - Proceedings of the …, 2023 - openaccess.thecvf.com
The CLIP model has been recently proven to be very effective for a variety of cross-modal
tasks, including the evaluation of captions generated from vision-and-language …

Injecting semantic concepts into end-to-end image captioning

Z Fang, J Wang, X Hu, L Liang, Z Gan… - Proceedings of the …, 2022 - openaccess.thecvf.com
Tremendous progress has been made in recent years in develo** better image captioning
models, yet most of them rely on a separate object detector to extract regional features …

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2024 - Springer
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …

Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching

Y Shi, X Yang, H Xu, C Yuan, B Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Current metrics for video captioning are mostly based on the text-level comparison between
reference and candidate captions. However, they have some insuperable drawbacks, eg …

Improving image captioning descriptiveness by ranking and llm-based fusion

S Bianco, L Celona, M Donzella… - arxiv preprint arxiv …, 2023 - arxiv.org
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-
COCO) dataset for training. This dataset contains annotations provided by human …

InfoMetIC: an informative metric for reference-free image caption evaluation

A Hu, S Chen, L Zhang, Q ** - arxiv preprint arxiv:2305.06002, 2023 - arxiv.org
Automatic image captioning evaluation is critical for benchmarking and promoting advances
in image captioning research. Existing metrics only provide a single score to measure …

Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

S Sarto, N Moratelli, M Cornia, L Baraldi… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite significant advancements in caption generation, existing evaluation metrics often fail
to capture the full quality or fine-grained details of captions. This is mainly due to their …

Deep learning approaches for image captioning: Opportunities, challenges and future potential

A Jamil, K Mahmood, MG Villar, T Prola… - IEEE …, 2024 - ieeexplore.ieee.org
Generative intelligence relies heavily on the integration of vision and language. Much of the
research has focused on image captioning, which involves describing images with …