Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval

D Jiang, M Ye - Proceedings of the IEEE/CVF Conference …, 2023 - openaccess.thecvf.com
Text-to-image person retrieval aims to identify the target person based on a given textual
description query. The primary challenge is to learn the map** of visual and textual …

Open-vocabulary detr with conditional matching

Y Zang, W Li, K Zhou, C Huang, CC Loy - European Conference on …, 2022 - Springer
Open-vocabulary object detection, which is concerned with the problem of detecting novel
objects guided by natural language, has gained increasing attention from the community …