Audio-Language Datasets of Scenes and Events: A Survey
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …
and scenes. Advances in dataset creation and computational power have led to significant …
Leveraging audio-only data for text-queried target sound extraction
The goal of text-queried target sound extraction (TSE) is to extract from a mixture a sound
source specified with a natural-language caption. While it is preferable to have access to …
source specified with a natural-language caption. While it is preferable to have access to …
Language-Queried Target Sound Extraction Without Parallel Training Data
Language-queried target sound extraction (TSE) aims to extract specific sounds from
mixtures based on language queries. Traditional fully-supervised training schemes require …
mixtures based on language queries. Traditional fully-supervised training schemes require …
Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework
for separating multichannel target signals from a multichannel mixture of sound sources …
for separating multichannel target signals from a multichannel mixture of sound sources …
SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model
Target sound extraction (TSE) consists of isolating a desired sound from a mixture of
arbitrary sounds using clues to identify it. A TSE system requires solving two problems at …
arbitrary sounds using clues to identify it. A TSE system requires solving two problems at …
[PDF][PDF] SRPOL submission to DCASE 2024 Challenge Task 9: modeling real and imaginary components, mixit and SDR based loss
M Romaniuk, J Krzywdziak - 2024 - dcase.community
We present our solution to the DCASE 2024 challenge task 9 (Language-Queried Audio
Source Separation). Our solution is based on the official baseline, with training dataset …
Source Separation). Our solution is based on the official baseline, with training dataset …
Beyond speaker identity: Text guided target speech extraction
Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's
identity like enrollment audio, face images, or videos, which may not always be available. In …
identity like enrollment audio, face images, or videos, which may not always be available. In …