A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Normalizing flows for probabilistic modeling and inference

G Papamakarios, E Nalisnick, DJ Rezende… - Journal of Machine …, 2021 - jmlr.org
Normalizing flows provide a general mechanism for defining expressive probability
distributions, only requiring the specification of a (usually simple) base distribution and a …

Bigvgan: A universal neural vocoder with large-scale training

S Lee, W **, B Ginsburg, B Catanzaro… - ar** architectures suitable for modeling raw audio is a challenging problem due to
the high sampling rates of audio waveforms. Standard sequence modeling approaches like …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Diffwave: A versatile diffusion model for audio synthesis

Z Kong, W **, J Huang, K Zhao… - arxiv preprint arxiv …, 2020 - arxiv.org
In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional
and unconditional waveform generation. The model is non-autoregressive, and converts the …

Wavegrad: Estimating gradients for waveform generation

N Chen, Y Zhang, H Zen, RJ Weiss, M Norouzi… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper introduces WaveGrad, a conditional model for waveform generation which
estimates gradients of the data density. The model is built on prior work on score matching …

Fastspeech 2: Fast and high-quality end-to-end text to speech

Y Ren, C Hu, X Tan, T Qin, S Zhao, Z Zhao… - arxiv preprint arxiv …, 2020 - arxiv.org
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize
speech significantly faster than previous autoregressive models with comparable quality …

Glow-tts: A generative flow for text-to-speech via monotonic alignment search

J Kim, S Kim, J Kong, S Yoon - Advances in Neural …, 2020 - proceedings.neurips.cc
Abstract Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been
proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the …

Normalizing flows: An introduction and review of current methods

I Kobyzev, SJD Prince… - IEEE transactions on …, 2020 - ieeexplore.ieee.org
Normalizing Flows are generative models which produce tractable distributions where both
sampling and density evaluation can be efficient and exact. The goal of this survey article is …