Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025‏ - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

A survey of audio-based music classification and annotation

Z Fu, G Lu, KM Ting, D Zhang - IEEE transactions on …, 2010‏ - ieeexplore.ieee.org
Music information retrieval (MIR) is an emerging research area that receives growing
attention from both the research community and music industry. It addresses the problem of …

Use what you have: Video retrieval using representations from collaborative experts

Y Liu, S Albanie, A Nagrani, A Zisserman - arxiv preprint arxiv …, 2019‏ - arxiv.org
The rapid growth of video on the internet has made searching for video content using natural
language queries a significant challenge. Human-generated queries for video datasetsin the …

Learning audio-video modalities from image captions

A Nagrani, PH Seo, B Seybold, A Hauth… - … on Computer Vision, 2022‏ - Springer
There has been a recent explosion of large-scale image-text datasets, as images with alt-
text captions can be easily obtained online. Obtaining large-scale, high quality data for video …

Audio retrieval with natural language queries: A benchmark study

AS Koepke, AM Oncescu, JF Henriques… - IEEE Transactions …, 2022‏ - ieeexplore.ieee.org
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the
goal is to retrieve the audio content from a pool of candidates that best matches a given …

Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning

ND Lane, P Georgiev, L Qendro - … of the 2015 ACM international joint …, 2015‏ - dl.acm.org
Microphones are remarkably powerful sensors of human behavior and context. However,
audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse …

Robust sound event classification using deep neural networks

I McLoughlin, H Zhang, Z **e, Y Song… - IEEE/ACM Transactions …, 2015‏ - ieeexplore.ieee.org
The automatic recognition of sound events by computers is an important aspect of emerging
applications such as automated surveillance, machine hearing and auditory scene …

Audio retrieval with natural language queries

AM Oncescu, A Koepke, JF Henriques, Z Akata… - arxiv preprint arxiv …, 2021‏ - arxiv.org
We consider the task of retrieving audio using free-form natural language queries. To study
this problem, which has received limited attention in the existing literature, we introduce …

Improving cross-modal retrieval with set of diverse embeddings

D Kim, N Kim, S Kwak - … of the IEEE/CVF Conference on …, 2023‏ - openaccess.thecvf.com
Cross-modal retrieval across image and text modalities is a challenging task due to its
inherent ambiguity: An image often exhibits various situations, and a caption can be coupled …

On metric learning for audio-text cross-modal retrieval

X Mei, X Liu, J Sun, MD Plumbley, W Wang - arxiv preprint arxiv …, 2022‏ - arxiv.org
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates
given a query in another modality. Solving such cross-modal retrieval task is challenging …