From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Contrastive attention for automatic chest x-ray report generation
Recently, chest X-ray report generation, which aims to automatically generate descriptions
of given chest X-ray images, has received growing research interests. The key challenge of …
of given chest X-ray images, has received growing research interests. The key challenge of …
Syntax-aware action targeting for video captioning
Existing methods on video captioning have made great efforts to identify objects/instances in
videos, but few of them emphasize the prediction of action. As a result, the learned models …
videos, but few of them emphasize the prediction of action. As a result, the learned models …
Rumor detection on social media with graph adversarial contrastive learning
Rumors spread through the Internet, especially on Twitter, have harmed social stability and
residents' daily lives. Recently, in addition to utilizing the text features of posts for rumor …
residents' daily lives. Recently, in addition to utilizing the text features of posts for rumor …
Fine-grained image captioning with clip reward
Modern image captioning models are usually trained with text similarity objectives. However,
since reference captions in public datasets often describe the most salient common objects …
since reference captions in public datasets often describe the most salient common objects …
Neural sign language translation based on human keypoint estimation
We propose a sign language translation system based on human keypoint estimation. It is
well-known that many problems in the field of computer vision require a massive dataset to …
well-known that many problems in the field of computer vision require a massive dataset to …
Contrabert: Enhancing code pre-trained models via contrastive learning
Large-scale pre-trained models such as CodeBERT, GraphCodeBERT have earned
widespread attention from both academia and industry. Attributed to the superior ability in …
widespread attention from both academia and industry. Attributed to the superior ability in …
Discriminability objective for training descriptive captions
One property that remains lacking in image captions generated by contemporary methods is
discriminability: being able to tell two images apart given the caption for one of them. We …
discriminability: being able to tell two images apart given the caption for one of them. We …
Trends in integration of vision and language research: A survey of tasks, datasets, and methods
Abstract Interest in Artificial Intelligence (AI) and its applications has seen unprecedented
growth in the last few years. This success can be partly attributed to the advancements made …
growth in the last few years. This success can be partly attributed to the advancements made …