Deep vision multimodal learning: Methodology, benchmark, and trend
Deep vision multimodal learning aims at combining deep visual representation learning with
other modalities, such as text, sound, and data collected from other sensors. With the fast …
other modalities, such as text, sound, and data collected from other sensors. With the fast …
A thorough review of models, evaluation metrics, and datasets on image captioning
G Luo, L Cheng, C **g, C Zhao… - IET Image Processing, 2022 - Wiley Online Library
Image captioning means generate descriptive sentences from a query image automatically.
It has recently received widespread attention from the computer vision and natural language …
It has recently received widespread attention from the computer vision and natural language …
Exploring group video captioning with efficient relational approximation
Current video captioning efforts most focus on describing a single video while the need for
captioning videos in groups has increased considerably. In this study, we propose a new …
captioning videos in groups has increased considerably. In this study, we propose a new …
Rethinking the reference-based distinctive image captioning
Distinctive Image Captioning (DIC)---generating distinctive captions that describe the unique
details of a target image---has received considerable attention over the last few years. A …
details of a target image---has received considerable attention over the last few years. A …
Progressive tree-structured prototype network for end-to-end image captioning
Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by
leveraging powerful visual pre-trained models and transformer-based generation …
leveraging powerful visual pre-trained models and transformer-based generation …
Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning
Discriminativeness is a desirable feature of image captions: captions should describe the
characteristic details of input images. However, recent high-performing captioning models …
characteristic details of input images. However, recent high-performing captioning models …
Improving reference-based distinctive image captioning with contrastive rewards
Distinctive Image Captioning (DIC)—generating distinctive captions that describe the unique
details of a target image—has received considerable attention over the last few years. A …
details of a target image—has received considerable attention over the last few years. A …
Distinctive image captioning via clip guided group optimization
Image captioning models are usually trained according to human annotated ground-truth
captions, which could generate accurate but generic captions. In this paper, we focus on …
captions, which could generate accurate but generic captions. In this paper, we focus on …
Learning descriptive image captioning via semipermeable maximum likelihood estimation
Image captioning aims to describe visual content in natural language. As'a picture is worth a
thousand words', there could be various correct descriptions for an image. However, with …
thousand words', there could be various correct descriptions for an image. However, with …
Pragmatic inference with a CLIP listener for contrastive captioning
We propose a simple yet effective and robust method for contrastive captioning: generating
discriminative captions that distinguish target images from very similar alternative distractor …
discriminative captions that distinguish target images from very similar alternative distractor …