Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Clipcap: Clip prefix for image captioning
Image captioning is a fundamental task in vision-language understanding, where the model
predicts a textual informative caption to a given input image. In this paper, we present a …
predicts a textual informative caption to a given input image. In this paper, we present a …
Scaling open-vocabulary image segmentation with image-level labels
We design an open-vocabulary image segmentation model to organize an image into
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …
Large language models: A survey
Large Language Models (LLMs) have drawn a lot of attention due to their strong
performance on a wide range of natural language tasks, since the release of ChatGPT in …
performance on a wide range of natural language tasks, since the release of ChatGPT in …
Making the most of text semantics to improve biomedical vision–language processing
Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …
Vinvl: Revisiting visual representations in vision-language models
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …
object detection model for vision language (VL) tasks. Compared to the most widely used …
BLEURT: Learning robust metrics for text generation
Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …
Suppress and balance: A simple gated network for salient object detection
Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as
their basic structures. These methods ignore two key problems when the encoder …
their basic structures. These methods ignore two key problems when the encoder …
Attention on attention for image captioning
Attention mechanisms are widely used in current encoder/decoder frameworks of image
captioning, where a weighted average on encoded vectors is generated at each time step to …
captioning, where a weighted average on encoded vectors is generated at each time step to …