A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Libritts: A corpus derived from librispeech for text-to-speech

H Zen, V Dang, R Clark, Y Zhang, RJ Weiss… - arxiv preprint arxiv …, 2019 - arxiv.org
This paper introduces a new speech corpus called" LibriTTS" designed for text-to-speech
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …

Fastpitch: Parallel text-to-speech with pitch prediction

A Łańcucki - ICASSP 2021-2021 IEEE International Conference …, 2021 - ieeexplore.ieee.org
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech,
conditioned on fundamental frequency contours. The model predicts pitch contours during …

Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning

Y Zhang, RJ Weiss, H Zen, Y Wu, Z Chen… - arxiv preprint arxiv …, 2019 - arxiv.org
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on
Tacotron that is able to produce high quality speech in multiple languages. Moreover, the …

Location-relative attention mechanisms for robust long-form speech synthesis

E Battenberg, RJ Skerry-Ryan… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-
end text-to-speech (TTS) systems suffer from text alignment failures that increase in …

PnG BERT: Augmented BERT on phonemes and graphemes for neural TTS

Y Jia, H Zen, J Shen, Y Zhang, Y Wu - arxiv preprint arxiv:2103.15060, 2021 - arxiv.org
This paper introduces PnG BERT, a new encoder model for neural TTS. This model is
augmented from the original BERT model, by taking both phoneme and grapheme …

Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech

G Zhang, K Song, X Tan, D Tan, Y Yan, Y Liu… - arxiv preprint arxiv …, 2022 - arxiv.org
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech
(TTS) has drawn increasing attention. However, the works apply pre-training with character …

Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data

S Park, D Kim, M Joe - arxiv preprint arxiv:2005.03295, 2020 - arxiv.org
We propose Cotatron, a transcription-guided speech encoder for speaker-independent
linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be …

Deep Griffin–Lim iteration: Trainable iterative phase reconstruction using neural network

Y Masuyama, K Yatabe, Y Koizumi… - IEEE Journal of …, 2020 - ieeexplore.ieee.org
In this paper, we propose a phase reconstruction framework, named Deep Griffin-Lim
Iteration (DeGLI). Phase reconstruction is a fundamental technique for improving the quality …

SoundChoice: Grapheme-to-phoneme models with semantic disambiguation

A Ploujnikov, M Ravanelli - arxiv preprint arxiv:2207.13703, 2022 - arxiv.org
End-to-end speech synthesis models directly convert the input characters into an audio
representation (eg, spectrograms). Despite their impressive performance, such models have …