Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Vinvl: Revisiting visual representations in vision-language models
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …
object detection model for vision language (VL) tasks. Compared to the most widely used …
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …
pairs are becoming popular for vision-language tasks. While existing methods simply …
Similarity reasoning and filtration for image-text matching
Image-text matching plays a critical role in bridging the vision and language, and great
progress has been made by exploiting the global alignment between image and sentence …
progress has been made by exploiting the global alignment between image and sentence …
Multi-modal knowledge graph construction and application: A survey
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …
Clip-driven fine-grained text-image person re-identification
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to
the given text query from a pool of candidate images. Existing methods employ prior …
the given text query from a pool of candidate images. Existing methods employ prior …
Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval
Enabling bi-directional retrieval of images and texts is important for understanding the
correspondence between vision and language. Existing methods leverage the attention …
correspondence between vision and language. Existing methods leverage the attention …
Stacked cross attention for image-text matching
In this paper, we study the problem of image-text matching. Inferring the latent semantic
alignment between objects or other salient stuff (eg snow, sky, lawn) and the corresponding …
alignment between objects or other salient stuff (eg snow, sky, lawn) and the corresponding …