Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

Y Wu, K Chen, T Zhang, Y Hui… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Contrastive learning has shown remarkable success in the field of multimodal
representation learning. In this paper, we propose a pipeline of contrastive language-audio …

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

X Mei, C Meng, H Liu, Q Kong, T Ko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
The advancement of audio-language (AL) multimodal learning tasks has been significant in
recent years, yet the limited size of existing audio-language datasets poses challenges for …

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

Separate anything you describe

X Liu, Q Kong, Y Zhao, H Liu, Y Yuan… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Language-queried audio source separation (LASS) is a new paradigm for computational
auditory scene analysis (CASA). LASS aims to separate a target sound from an audio …

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Y **n, D Yang, Y Zou - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and
audio, the semantic information contained in the text is only similar to certain frames within …

Audio retrieval with wavtext5k and clap training

S Deshmukh, B Elizalde, H Wang - arxiv preprint arxiv:2209.14275, 2022 - arxiv.org
Audio-Text retrieval takes a natural language query to retrieve relevant audio files in a
database. Conversely, Text-Audio retrieval takes an audio file as a query to retrieve relevant …

Modality-independent teachers meet weakly-supervised audio-visual event parser

YH Lai, YC Chen, F Wang - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Audio-visual learning has been a major pillar of multi-modal machine learning, where the
community mostly focused on its $\textit {modality-aligned} $ setting, $\textit {ie} $, the audio …

Flap: Fast language-audio pre-training

CF Yeh, PY Huang, V Sharma, SW Li… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
We propose Fast Language-Audio Pre-training (FLAP), a self-supervised approach that
efficiently and effectively learns aligned audio and language representations through …