Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Vinvl: Revisiting visual representations in vision-language models

P Zhang, X Li, X Hu, J Yang, L Zhang… - Proceedings of the …, 2021 - openaccess.thecvf.com
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

X Li, X Yin, C Li, P Zhang, X Hu, L Zhang… - Computer Vision–ECCV …, 2020 - Springer
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …

Similarity reasoning and filtration for image-text matching

H Diao, Y Zhang, L Ma, H Lu - Proceedings of the AAAI conference on …, 2021 - ojs.aaai.org
Image-text matching plays a critical role in bridging the vision and language, and great
progress has been made by exploiting the global alignment between image and sentence …

Multi-modal knowledge graph construction and application: A survey

X Zhu, Z Li, X Wang, X Jiang, P Sun… - … on Knowledge and …, 2022 - ieeexplore.ieee.org
Recent years have witnessed the resurgence of knowledge engineering which is featured
by the fast growth of knowledge graphs. However, most of existing knowledge graphs are …

Clip-driven fine-grained text-image person re-identification

S Yan, N Dong, L Zhang, J Tang - IEEE Transactions on Image …, 2023 - ieeexplore.ieee.org
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to
the given text query from a pool of candidate images. Existing methods employ prior …

Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval

H Chen, G Ding, X Liu, Z Lin, J Liu… - Proceedings of the …, 2020 - openaccess.thecvf.com
Enabling bi-directional retrieval of images and texts is important for understanding the
correspondence between vision and language. Existing methods leverage the attention …

Stacked cross attention for image-text matching

KH Lee, X Chen, G Hua, H Hu… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we study the problem of image-text matching. Inferring the latent semantic
alignment between objects or other salient stuff (eg snow, sky, lawn) and the corresponding …