Audiogpt: Understanding and generating speech, music, sound, and talking head
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …
domains and tasks, challenging our understanding of learning and cognition. Despite the …
TF-GridNet: Integrating full-and sub-band modeling for speech separation
We propose TF-GridNet for speech separation. The model is a novel deep neural network
(DNN) integrating full-and sub-band modeling in the time-frequency (TF) domain. It stacks …
(DNN) integrating full-and sub-band modeling in the time-frequency (TF) domain. It stacks …
Spmamba: State-space model is all you need in speech separation
Existing CNN-based speech separation models face local receptive field limitations and
cannot effectively capture long time dependencies. Although LSTM and Transformer-based …
cannot effectively capture long time dependencies. Although LSTM and Transformer-based …
CompNet: Complementary network for single-channel speech enhancement
Recent multi-domain processing methods have demonstrated promising performance for
monaural speech enhancement tasks. However, few of them explain why they behave better …
monaural speech enhancement tasks. However, few of them explain why they behave better …
NeuroHeed: Neuro-steered speaker extraction using EEG signals
Humans possess the remarkable ability to selectively attend to a single speaker amidst
competing voices and background noise, known as selective auditory attention. Recent …
competing voices and background noise, known as selective auditory attention. Recent …
Diffusion-based generative speech source separation
We propose DiffSep, a new single channel source separation method based on score-
matching of a stochastic differential equation (SDE). We craft a tailored continuous time …
matching of a stochastic differential equation (SDE). We craft a tailored continuous time …
State space model for new-generation network alternative to transformers: A survey
In the post-deep learning era, the Transformer architecture has demonstrated its powerful
performance across pre-trained big models and various downstream tasks. However, the …
performance across pre-trained big models and various downstream tasks. However, the …
Toward universal speech enhancement for diverse input conditions
The past decade has witnessed substantial growth of data-driven speech enhancement (SE)
techniques thanks to deep learning. While existing approaches have shown impressive …
techniques thanks to deep learning. While existing approaches have shown impressive …
X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion
Target speaker extraction (TSE) which has the capability to directly extract desired speech
given enrollment utterances of the target speaker has attracted more and more attention for …
given enrollment utterances of the target speaker has attracted more and more attention for …
Wesep: A scalable and flexible toolkit towards generalizable target speaker extraction
Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker
from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In …
from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In …