Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - IEEE …, 2025 - ieeexplore.ieee.org
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …

Leveraging audio-only data for text-queried target sound extraction

K Saijo, J Ebbers, FG Germain, S Khurana… - arxiv preprint arxiv …, 2024 - arxiv.org
The goal of text-queried target sound extraction (TSE) is to extract from a mixture a sound
source specified with a natural-language caption. While it is preferable to have access to …

Language-Queried Target Sound Extraction Without Parallel Training Data

H Ma, Z Peng, X Li, Y Li, M Shao, Q Kong… - arxiv preprint arxiv …, 2024 - arxiv.org
Language-queried target sound extraction (TSE) aims to extract specific sounds from
mixtures based on language queries. Traditional fully-supervised training schemes require …

Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues

D Choi, JW Choi - arxiv preprint arxiv:2409.12415, 2024 - arxiv.org
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework
for separating multichannel target signals from a multichannel mixture of sound sources …

SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model

C Hernandez-Olivan, M Delcroix, T Ochiai… - arxiv preprint arxiv …, 2024 - arxiv.org
Target sound extraction (TSE) consists of isolating a desired sound from a mixture of
arbitrary sounds using clues to identify it. A TSE system requires solving two problems at …

[PDF][PDF] SRPOL submission to DCASE 2024 Challenge Task 9: modeling real and imaginary components, mixit and SDR based loss

M Romaniuk, J Krzywdziak - 2024 - dcase.community
We present our solution to the DCASE 2024 challenge task 9 (Language-Queried Audio
Source Separation). Our solution is based on the official baseline, with training dataset …

Beyond speaker identity: Text guided target speech extraction

M Huo, A Jain, CP Huynh, F Kong, P Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's
identity like enrollment audio, face images, or videos, which may not always be available. In …