Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P **, J Huang, P **ong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Cross modal retrieval with querybank normalisation

SV Bogolin, I Croitoru, H **, Y Liu… - Proceedings of the …, 2022 - openaccess.thecvf.com
Profiting from large-scale training datasets, advances in neural architecture design and
efficient inference, joint embeddings have become the dominant approach for tackling cross …

Text-video retrieval with disentangled conceptualization and set-to-set alignment

P **, H Li, Z Cheng, J Huang, Z Wang, L Yuan… - arxiv preprint arxiv …, 2023 - arxiv.org
Text-video retrieval is a challenging cross-modal task, which aims to align visual entities with
natural language descriptions. Current methods either fail to leverage the local details or are …

Dual alignment unsupervised domain adaptation for video-text retrieval

X Hao, W Zhang, D Wu, F Zhu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Video-text retrieval is an emerging stream in both computer vision and natural language
processing communities, which aims to find relevant videos given text queries. In this paper …

OMGH: Online manifold-guided hashing for flexible cross-modal retrieval

X Liu, J Yi, Y Cheung, X Xu, Z Cui - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Cross-modal hashing hasrecently gained an increasing attention for its efficiency and fast
retrieval speed in indexing the multimedia data across different modalities. Nevertheless, the …

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

J Wu, B Duan, W Kang, H Tang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
While Transformers have rapidly gained popularity in various computer vision applications
post-hoc explanations of their internal mechanisms remain largely unexplored. Vision …

Prototype-guided knowledge transfer for federated unsupervised cross-modal hashing

J Li, F Li, L Zhu, H Cui, J Li - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
Although deep cross-modal hashing methods have shown superiorities for cross-modal
retrieval recently, there is a concern about potential data privacy leakage when training the …

Noise is also useful: Negative correlation-steered latent contrastive learning

J Yan, L Luo, C Xu, C Deng… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
How to effectively handle label noise has been one of the most practical but challenging
tasks in Deep Neural Networks (DNNs). Recent popular methods for training DNNs with …

Semantic collaborative learning for cross-modal moment localization

Y Hu, K Wang, M Liu, H Tang, L Nie - ACM Transactions on Information …, 2023 - dl.acm.org
Localizing a desired moment within an untrimmed video via a given natural language query,
ie, cross-modal moment localization, has attracted widespread research attention recently …