Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Separate anything you describe

X Liu, Q Kong, Y Zhao, H Liu, Y Yuan… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Language-queried audio source separation (LASS) is a new paradigm for computational
auditory scene analysis (CASA). LASS aims to separate a target sound from an audio …

Improving text-audio retrieval by text-aware attention pooling and prior matrix revised loss

Y **, T Virtanen - arxiv preprint arxiv:2206.06108, 2022 - arxiv.org
Language-based audio retrieval is a task, where natural language textual captions are used
as queries to retrieve audio signals from a dataset. It has been first introduced into DCASE …

Improving audio-text retrieval via hierarchical cross-modal interaction and auxiliary captions

Y **n, Y Zou - arxiv preprint arxiv:2307.15344, 2023 - arxiv.org
Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs
between whole audio clips and complete caption sentences, while ignoring fine-grained …

Cooperative game modeling with weighted token-level alignment for audio-text retrieval

Y **n, B Wang, L Shang - IEEE Signal Processing Letters, 2023 - ieeexplore.ieee.org
Previous audio-text retrieval (ATR) methods primarily concentrate on constructing
contrastive pairs between entire audio clips and full caption sentences, while neglecting fine …

[PDF][PDF] Cp-jku's submission to task 6b of the dcase2023 challenge: Audio retrieval with passt and gpt-augmented captions

P Primus, K Koutini, G Widmer - tech. rep., DCASE2023 …, 2023 - dcase.community
This technical report describes CP-JKU's submission to the naturallanguage-based audio
retrieval task of the 2023 DCASE Challenge (Task 6b). Our proposed system uses …