Audiogpt: Understanding and generating speech, music, sound, and talking head

R Huang, M Li, D Yang, J Shi, X Chang, Z Ye… - Proceedings of the …, 2024 - ojs.aaai.org
Large language models (LLMs) have exhibited remarkable capabilities across a variety of
domains and tasks, challenging our understanding of learning and cognition. Despite the …

TF-GridNet: Integrating full-and sub-band modeling for speech separation

ZQ Wang, S Cornell, S Choi, Y Lee… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
We propose TF-GridNet for speech separation. The model is a novel deep neural network
(DNN) integrating full-and sub-band modeling in the time-frequency (TF) domain. It stacks …

Spmamba: State-space model is all you need in speech separation

K Li, G Chen, R Yang, X Hu - arxiv preprint arxiv:2404.02063, 2024 - arxiv.org
Existing CNN-based speech separation models face local receptive field limitations and
cannot effectively capture long time dependencies. Although LSTM and Transformer-based …

CompNet: Complementary network for single-channel speech enhancement

C Fan, H Zhang, A Li, W **ang, C Zheng, Z Lv, X Wu - Neural Networks, 2023 - Elsevier
Recent multi-domain processing methods have demonstrated promising performance for
monaural speech enhancement tasks. However, few of them explain why they behave better …

NeuroHeed: Neuro-steered speaker extraction using EEG signals

Z Pan, M Borsdorf, S Cai, T Schultz… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Humans possess the remarkable ability to selectively attend to a single speaker amidst
competing voices and background noise, known as selective auditory attention. Recent …

Diffusion-based generative speech source separation

R Scheibler, Y Ji, SW Chung, J Byun… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We propose DiffSep, a new single channel source separation method based on score-
matching of a stochastic differential equation (SDE). We craft a tailored continuous time …

State space model for new-generation network alternative to transformers: A survey

X Wang, S Wang, Y Ding, Y Li, W Wu, Y Rong… - arxiv preprint arxiv …, 2024 - arxiv.org
In the post-deep learning era, the Transformer architecture has demonstrated its powerful
performance across pre-trained big models and various downstream tasks. However, the …

Toward universal speech enhancement for diverse input conditions

W Zhang, K Saijo, ZQ Wang… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
The past decade has witnessed substantial growth of data-driven speech enhancement (SE)
techniques thanks to deep learning. While existing approaches have shown impressive …

X-tf-gridnet: A time–frequency domain target speaker extraction network with adaptive speaker embedding fusion

F Hao, X Li, C Zheng - Information Fusion, 2024 - Elsevier
Target speaker extraction (TSE) which has the capability to directly extract desired speech
given enrollment utterances of the target speaker has attracted more and more attention for …

Wesep: A scalable and flexible toolkit towards generalizable target speaker extraction

S Wang, K Zhang, S Lin, J Li, X Wang, M Ge… - arxiv preprint arxiv …, 2024 - arxiv.org
Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker
from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In …