A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

End-to-end adversarial text-to-speech

J Donahue, S Dieleman, M Bińkowski, E Elsen… - arxiv preprint arxiv …, 2020 - arxiv.org
Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each
of which is designed or learnt independently from the rest. In this work, we take on the …

Parallel tacotron: Non-autoregressive and controllable tts

I Elias, H Zen, J Shen, Y Zhang, Y Jia… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Although neural end-to-end text-to-speech models can synthesize highly natural speech,
there is still room for improvements to its efficiency and naturalness. This paper proposes a …

Rall-e: Robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis

D **n, X Tan, K Shen, Z Ju, D Yang, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis.
While previous work based on large language models (LLMs) shows impressive …

A vector quantized approach for text to speech synthesis on real-world spontaneous speech

LW Chen, S Watanabe, A Rudnicky - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Abstract Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have
achieved near human-level naturalness. The diversity of human speech, however, often …

Multispeech: Multi-speaker text to speech with transformer

M Chen, X Tan, Y Ren, J Xu, H Sun, S Zhao… - arxiv preprint arxiv …, 2020 - arxiv.org
Transformer-based text to speech (TTS) model (eg, Transformer TTS~\cite {li2019neural},
FastSpeech~\cite {ren2019fastspeech}) has shown the advantages of training and inference …

Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling

J Shen, Y Jia, M Chrzanowski, Y Zhang, I Elias… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper presents Non-Attentive Tacotron based on the Tacotron 2 text-to-speech model,
replacing the attention mechanism with an explicit duration predictor. This improves …

VALL-E R: Robust and efficient zero-shot text-to-speech synthesis via monotonic alignment

B Han, L Zhou, S Liu, S Chen, L Meng, Y Qian… - arxiv preprint arxiv …, 2024 - arxiv.org
With the help of discrete neural audio codecs, large language models (LLM) have
increasingly been recognized as a promising methodology for zero-shot Text-to-Speech …

Location-relative attention mechanisms for robust long-form speech synthesis

E Battenberg, RJ Skerry-Ryan… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-
end text-to-speech (TTS) systems suffer from text alignment failures that increase in …

Parallel Tacotron 2: A non-autoregressive neural TTS model with differentiable duration modeling

I Elias, H Zen, J Shen, Y Zhang, Y Jia… - arxiv preprint arxiv …, 2021 - arxiv.org
This paper introduces Parallel Tacotron 2, a non-autoregressive neural text-to-speech
model with a fully differentiable duration model which does not require supervised duration …