An overview of deep-learning-based audio-visual speech enhancement and separation

D Michelsanti, ZH Tan, SX Zhang, Y Xu… - … on Audio, Speech …, 2021 - ieeexplore.ieee.org
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …

Supervised speech separation based on deep learning: An overview

DL Wang, J Chen - IEEE/ACM transactions on audio, speech …, 2018 - ieeexplore.ieee.org
Speech separation is the task of separating target speech from background interference.
Traditionally, speech separation is studied as a signal processing problem. A more recent …

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement

Y Hu, Y Liu, S Lv, M **ng, S Zhang, Y Fu, J Wu… - arxiv preprint arxiv …, 2020 - arxiv.org
Speech enhancement has benefited from the success of deep learning in terms of
intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods …

Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation

Y Luo, N Mesgarani - IEEE/ACM transactions on audio, speech …, 2019 - ieeexplore.ieee.org
Single-channel, speaker-independent speech separation methods have recently seen great
progress. However, the accuracy, latency, and computational cost of such methods remain …

Deep learning for audio signal processing

H Purwins, B Li, T Virtanen, J Schlüter… - IEEE Journal of …, 2019 - ieeexplore.ieee.org
Given the recent surge in developments of deep learning, this paper provides a review of the
state-of-the-art deep learning techniques for audio signal processing. Speech, music, and …

Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation

A Ephrat, I Mosseri, O Lang, T Dekel, K Wilson… - arxiv preprint arxiv …, 2018 - arxiv.org
We present a joint audio-visual model for isolating a single speech signal from a mixture of
sounds such as other speakers and background noise. Solving this task using only audio as …

TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain

A Pandey, DL Wang - ICASSP 2019-2019 IEEE International …, 2019 - ieeexplore.ieee.org
This work proposes a fully convolutional neural network (CNN) for real-time speech
enhancement in the time domain. The proposed CNN is an encoder-decoder based …

Phasen: A phase-and-harmonics-aware speech enhancement network

D Yin, C Luo, Z **ong, W Zeng - Proceedings of the AAAI conference on …, 2020 - ojs.aaai.org
Time-frequency (TF) domain masking is a mainstream approach for single-channel speech
enhancement. Recently, focuses have been put to phase prediction in addition to amplitude …

[PDF][PDF] A convolutional recurrent neural network for real-time speech enhancement.

K Tan, DL Wang - Interspeech, 2018 - researchgate.net
Many real-world applications of speech enhancement, such as hearing aids and cochlear
implants, desire real-time processing, with no or low latency. In this paper, we propose a …