Image-text retrieval: A survey on recent research and development

M Cao, S Li, J Li, L Nie, M Zhang - arxiv preprint arxiv:2203.14713, 2022 - arxiv.org
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …

Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Fashionvlp: Vision language transformer for fashion retrieval with feedback

S Goenka, Z Zheng, A Jaiswal… - Proceedings of the …, 2022 - openaccess.thecvf.com
Fashion image retrieval based on a query pair of reference image and natural language
feedback is a challenging task that requires models to assess fashion related information …

Vista: Vision and scene text aggregation for cross-modal retrieval

M Cheng, Y Sun, L Wang, X Zhu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual appearance is considered to be the most important cue to understand images for
cross-modal retrieval, while sometimes the scene text appearing in images can provide …

A large cross-modal video retrieval dataset with reading comprehension

W Wu, Y Zhao, Z Li, J Li, H Zhou, MZ Shou, X Bai - Pattern Recognition, 2025 - Elsevier
Most existing cross-modal language-to-video retrieval (VR) research focuses on single-
modal input from video, ie, visual representation, while the text is omnipresent in human …

Mmpedia: A large-scale multi-modal knowledge graph

Y Wu, X Wu, J Li, Y Zhang, H Wang, W Du, Z He… - International semantic …, 2023 - Springer
Abstract Knowledge graphs serve as crucial resources for various applications. However,
most existing knowledge graphs present symbolic knowledge in the form of natural …

Ocr-idl: Ocr annotations for industry document library dataset

AF Biten, R Tito, L Gomez, E Valveny… - European Conference on …, 2022 - Springer
Pretraining has proven successful in Document Intelligence tasks where deluge of
documents are used to pretrain the models only later to be finetuned on downstream tasks …

Is an image worth five sentences? A new look into semantics for image-text matching

AF Biten, A Mafla, L Gómez… - Proceedings of the …, 2022 - openaccess.thecvf.com
The task of image-text matching aims to map representations from different modalities into a
common joint visual-textual embedding. However, the most widely used datasets for this …

Bcra: bidirectional cross-modal implicit relation reasoning and aligning for text-to-image person retrieval

Z Li, Y **e - Multimedia Systems, 2024 - Springer
Text-to-image person retrieval aims to retrieve relevant target individuals based on given
textual descriptions. The main challenge faced by this task is how to better combine and …

Adaptive transformer-based conditioned variational autoencoder for incomplete social event classification

Z Li, S Qian, J Cao, Q Fang, C Xu - Proceedings of the 30th ACM …, 2022 - dl.acm.org
With the rapid development of the Internet and the expanding scale of social media,
incomplete social event classification has increasingly become a challenging task. The key …