Deep vision multimodal learning: Methodology, benchmark, and trend

W Chai, G Wang - Applied Sciences, 2022 - mdpi.com
Deep vision multimodal learning aims at combining deep visual representation learning with
other modalities, such as text, sound, and data collected from other sensors. With the fast …

A thorough review of models, evaluation metrics, and datasets on image captioning

G Luo, L Cheng, C **g, C Zhao… - IET Image Processing, 2022 - Wiley Online Library
Image captioning means generate descriptive sentences from a query image automatically.
It has recently received widespread attention from the computer vision and natural language …

Exploring group video captioning with efficient relational approximation

W Lin, T **, Y Wang, W Pan, L Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Current video captioning efforts most focus on describing a single video while the need for
captioning videos in groups has increased considerably. In this study, we propose a new …

Rethinking the reference-based distinctive image captioning

Y Mao, L Chen, Z Jiang, D Zhang, Z Zhang… - Proceedings of the 30th …, 2022 - dl.acm.org
Distinctive Image Captioning (DIC)---generating distinctive captions that describe the unique
details of a target image---has received considerable attention over the last few years. A …

Progressive tree-structured prototype network for end-to-end image captioning

P Zeng, J Zhu, J Song, L Gao - … of the 30th ACM International Conference …, 2022 - dl.acm.org
Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by
leveraging powerful visual pre-trained models and transformer-based generation …

Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning

U Honda, T Watanabe… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Discriminativeness is a desirable feature of image captions: captions should describe the
characteristic details of input images. However, recent high-performing captioning models …

Improving reference-based distinctive image captioning with contrastive rewards

Y Mao, J **ao, D Zhang, M Cao, J Shao… - ACM Transactions on …, 2024 - dl.acm.org
Distinctive Image Captioning (DIC)—generating distinctive captions that describe the unique
details of a target image—has received considerable attention over the last few years. A …

Distinctive image captioning via clip guided group optimization

Y Zhang, J Wang, H Wu, W Xu - European Conference on Computer …, 2022 - Springer
Image captioning models are usually trained according to human annotated ground-truth
captions, which could generate accurate but generic captions. In this paper, we focus on …

Learning descriptive image captioning via semipermeable maximum likelihood estimation

Z Yue, A Hu, L Zhang, Q ** - Advances in Neural …, 2023 - proceedings.neurips.cc
Image captioning aims to describe visual content in natural language. As'a picture is worth a
thousand words', there could be various correct descriptions for an image. However, with …

Pragmatic inference with a CLIP listener for contrastive captioning

J Ou, B Krojer, D Fried - arxiv preprint arxiv:2306.08818, 2023 - arxiv.org
We propose a simple yet effective and robust method for contrastive captioning: generating
discriminative captions that distinguish target images from very similar alternative distractor …