Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment

Z Fu, L Zhang, H **a, Z Mao - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Cross-modal alignment aims to build a bridge connecting vision and language. It is an
important multi-modal task that efficiently learns the semantic similarities between images …

Adapting vision-language models via learning to inject knowledge

S Xuan, M Yang, S Zhang - IEEE Transactions on Image …, 2024 - ieeexplore.ieee.org
Pre-trained vision-language models (VLM) such as CLIP, have demonstrated impressive
zero-shot performance on various vision tasks. Trained on millions or even billions of image …

LPCR-IoT: Lightweight and privacy-preserving cross-modal Retrieval in IoT

M Li, Y Zhu, R Du, C Jia - IEEE Internet of Things Journal, 2025 - ieeexplore.ieee.org
As a pivotal link between visual and linguistic relationships, image-text cross-modal retrieval
has received widespread attention. However, existing studies primarily focus on intricate …

Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review

P Lamart, Y Yu, C Berger - 2024 50th Euromicro Conference on …, 2024 - ieeexplore.ieee.org
Machine Learning (ML) is continuously permeating a growing amount of application
domains. Generative AI such as Large Language Models (LLMs) also sees broad adoption …

Knowledge Graph Enhanced Multimodal Transformer for Image-Text Retrieval

J Zheng, M Liang, Y Yu, Y Li… - 2024 IEEE 40th …, 2024 - ieeexplore.ieee.org
Image-text retrieval is a fundamental cross-modal task that aims to align the representation
spaces between the image and text modalities. Existing cross-modal image-text retrieval …

Perceive, Reason, and Align: Context-guided cross-modal correlation learning for image–text retrieval

Z Liu, X Pei, S Gao, C Li, J Wang, J Xu - Applied Soft Computing, 2024 - Elsevier
Due to the inconsistency in feature representations between different modalities, namely
“Heterogeneous gap”, it remains a persistent challenge to correlate images and texts …

Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval

Z Wei, K **, X Zhou - arxiv preprint arxiv:2312.15840, 2023 - arxiv.org
Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis
and various medical generative tasks. Eliminating heterogeneity between different …

Ensemble Prototype Networks for Unsupervised Cross-modal Hashing with Cross-Task Consistency

X Liu, H Zeng, Y Shi, J Zhu, K Yang… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
In the swiftly advancing realm of information retrieval, unsupervised cross-modal hashing
has emerged as a focal point of research, taking advantage of the inherent advantages of …

SIRS: Multi-task Joint Learning for Remote Sensing Foreground-entity Image-text Retrieval

Z Zhu, J Kang, W Diao, Y Feng, J Li… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The essence of improving the effect of cross-modal image–text retrieval (CIR) lies in the finer-
grained modeling of homogeneous features between modalities. However, in remote …