Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Dynamic modality interaction modeling for image-text retrieval

L Qu, M Liu, J Wu, Z Gao, L Nie - … of the 44th International ACM SIGIR …, 2021 - dl.acm.org
Image-text retrieval is a fundamental and crucial branch in information retrieval. Although
much progress has been made in bridging vision and language, it remains challenging …

Kaleido-bert: Vision-language pre-training on fashion domain

M Zhuge, D Gao, DP Fan, L **… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which
introduces a novel kaleido strategy for fashion cross-modality representations from …

Image-text retrieval: A survey on recent research and development

M Cao, S Li, J Li, L Nie, M Zhang - arxiv preprint arxiv:2203.14713, 2022 - arxiv.org
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …

M6: A chinese multimodal pretrainer

J Lin, R Men, A Yang, C Zhou, M Ding, Y Zhang… - arxiv preprint arxiv …, 2021 - arxiv.org
In this work, we construct the largest dataset for multimodal pretraining in Chinese, which
consists of over 1.9 TB images and 292GB texts that cover a wide range of domains. We …

Ei-clip: Entity-aware interventional contrastive learning for e-commerce cross-modal retrieval

H Ma, H Zhao, Z Lin, A Kale, Z Wang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract recommendation, and marketing services. Extensive efforts have been made to
conquer the cross-modal retrieval problem in the general domain. When it comes to E …

Fame-vil: Multi-tasking vision-language model for heterogeneous fashion tasks

X Han, X Zhu, L Yu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
In the fashion domain, there exists a variety of vision-and-language (V+ L) tasks, including
cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image …

Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Vision-and-language pretrained models: A survey

S Long, F Cao, SC Han, H Yang - arxiv preprint arxiv:2204.07356, 2022 - arxiv.org
Pretrained models have produced great success in both Computer Vision (CV) and Natural
Language Processing (NLP). This progress leads to learning joint representations of vision …