Overview of speaker modeling and its applications: From the lens of deep speaker representation learning
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …
By thoroughly and accurately modeling this information, it can be utilized in various …
NeuroHeed: Neuro-steered speaker extraction using EEG signals
Humans possess the remarkable ability to selectively attend to a single speaker amidst
competing voices and background noise, known as selective auditory attention. Recent …
competing voices and background noise, known as selective auditory attention. Recent …
Enhancing code-switching speech recognition with interactive language biases
Languages usually switch within a multilingual speech signal, especially in a bilingual
society. This phenomenon is referred to as code-switching (CS), making automatic speech …
society. This phenomenon is referred to as code-switching (CS), making automatic speech …
Used: Universal speaker extraction and diarization
Speaker extraction and diarization are two enabling techniques for real-world speech
applications. Speaker extraction aims to extract a target speaker's voice from a speech …
applications. Speaker extraction aims to extract a target speaker's voice from a speech …
Speech foundation model ensembles for the controlled singing voice deepfake detection (ctrsvdd) challenge 2024
This work details our approach to achieving a leading system with a 1.79% pooled equal
error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection …
error rate (EER) on the evaluation set of the Controlled Singing Voice Deepfake Detection …
Multi-stage Face-voice Association Learning with Keynote Speaker Diarization
The human brain has the capability to associate the unknown person's voice and face by
leveraging their general relationship, referred to as" cross-modal speaker verification''. This …
leveraging their general relationship, referred to as" cross-modal speaker verification''. This …
Text-Queried Target Sound Event Localization
Sound event localization and detection (SELD) aims to determine the appearance of sound
classes, together with their Direction of Arrival (DOA). However, current SELD systems can …
classes, together with their Direction of Arrival (DOA). However, current SELD systems can …
DENSE: Dynamic Embedding Causal Target Speech Extraction
Y Wang, Z Yuan, X Wu - arxiv preprint arxiv:2409.06136, 2024 - arxiv.org
Target speech extraction (TSE) focuses on extracting the speech of a specific target speaker
from a mixture of signals. Existing TSE models typically utilize static embeddings as …
from a mixture of signals. Existing TSE models typically utilize static embeddings as …
Target Speech Diarization with Multimodal Prompts
Traditional speaker diarization seeks to detect``who spoke when''according to speaker
characteristics. Extending to target speech diarization, we detect``when target event …
characteristics. Extending to target speech diarization, we detect``when target event …
A Synopsis of FAME 2024 Challenge: Associating Faces with Voices in Multilingual Environments
Over half of the world's population is bilingual and people often communicate under
multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) …
multilingual scenarios. The Face-Voice Association in Multilingual Environments (FAME) …