A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
Libritts: A corpus derived from librispeech for text-to-speech
This paper introduces a new speech corpus called" LibriTTS" designed for text-to-speech
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …
Fastpitch: Parallel text-to-speech with pitch prediction
A Łańcucki - ICASSP 2021-2021 IEEE International Conference …, 2021 - ieeexplore.ieee.org
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech,
conditioned on fundamental frequency contours. The model predicts pitch contours during …
conditioned on fundamental frequency contours. The model predicts pitch contours during …
Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning
We present a multispeaker, multilingual text-to-speech (TTS) synthesis model based on
Tacotron that is able to produce high quality speech in multiple languages. Moreover, the …
Tacotron that is able to produce high quality speech in multiple languages. Moreover, the …
Location-relative attention mechanisms for robust long-form speech synthesis
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-
end text-to-speech (TTS) systems suffer from text alignment failures that increase in …
end text-to-speech (TTS) systems suffer from text alignment failures that increase in …
PnG BERT: Augmented BERT on phonemes and graphemes for neural TTS
This paper introduces PnG BERT, a new encoder model for neural TTS. This model is
augmented from the original BERT model, by taking both phoneme and grapheme …
augmented from the original BERT model, by taking both phoneme and grapheme …
Mixed-phoneme bert: Improving bert with mixed phoneme and sup-phoneme representations for text to speech
Recently, leveraging BERT pre-training to improve the phoneme encoder in text to speech
(TTS) has drawn increasing attention. However, the works apply pre-training with character …
(TTS) has drawn increasing attention. However, the works apply pre-training with character …
Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data
We propose Cotatron, a transcription-guided speech encoder for speaker-independent
linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be …
linguistic representation. Cotatron is based on the multispeaker TTS architecture and can be …
Deep Griffin–Lim iteration: Trainable iterative phase reconstruction using neural network
In this paper, we propose a phase reconstruction framework, named Deep Griffin-Lim
Iteration (DeGLI). Phase reconstruction is a fundamental technique for improving the quality …
Iteration (DeGLI). Phase reconstruction is a fundamental technique for improving the quality …
SoundChoice: Grapheme-to-phoneme models with semantic disambiguation
A Ploujnikov, M Ravanelli - arxiv preprint arxiv:2207.13703, 2022 - arxiv.org
End-to-end speech synthesis models directly convert the input characters into an audio
representation (eg, spectrograms). Despite their impressive performance, such models have …
representation (eg, spectrograms). Despite their impressive performance, such models have …