A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Expressive TTS training with frame and style reconstruction loss

R Liu, B Sisman, G Gao, H Li - IEEE/ACM Transactions on …, 2021 - ieeexplore.ieee.org
We propose a novel training strategy for Tacotron-based text-to-speech (TTS) system that
improves the speech styling at utterance level. One of the key challenges in prosody …

Teacher-student training for robust tacotron-based tts

R Liu, B Sisman, J Li, F Bao, G Gao… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
While neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods
in many ways, the exposure bias problem in the autoregressive models remains an issue to …

The sequence-to-sequence baseline for the voice conversion challenge 2020: Cascading asr and tts

WC Huang, T Hayashi, S Watanabe, T Toda - arxiv preprint arxiv …, 2020 - arxiv.org
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice
conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC) …

Modeling prosodic phrasing with multi-task learning in tacotron-based TTS

R Liu, B Sisman, F Bao, G Gao… - IEEE Signal Processing …, 2020 - ieeexplore.ieee.org
Tacotron-based end-to-end speech synthesis has shown remarkable voice quality.
However, the rendering of prosody in the synthesized speech remains to be improved …

Efficient neural speech synthesis for low-resource languages through multilingual modeling

M de Korte, J Kim, E Klabbers - arxiv preprint arxiv:2008.09659, 2020 - arxiv.org
Recent advances in neural TTS have led to models that can produce high-quality synthetic
speech. However, these models typically require large amounts of training data, which can …

Deepfake defense: Constructing and evaluating a specialized Urdu deepfake audio dataset

S Munir, W Sajjad, M Raza, E Abbas… - Findings of the …, 2024 - aclanthology.org
Deepfakes, particularly in the auditory domain, have become a significant threat,
necessitating the development of robust countermeasures. This paper addresses the …

[PDF][PDF] ArmSpeech: Armenian spoken language corpus

VH Baghdasaryan - International Journal of Scientific Advances (IJSCIA), 2022 - ijscia.com
The Armenian language is an independent branch of the Indo-European language family
and the official language of the Republic of Armenia and the Republic of Artsakh. According …

[PDF][PDF] Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis.

K Fujita, A Ando, Y Ijima - Interspeech, 2021 - isca-archive.org
This paper proposes a novel speech-rhythm-based method for speaker embeddings.
Conventionally spectral feature-based speaker embedding vectors such as the x-vector are …

Deep Gaussian process based multi-speaker speech synthesis with latent speaker representation

K Mitsui, T Koriyama, H Saruwatari - Speech Communication, 2021 - Elsevier
This paper proposes deep Gaussian process (DGP)-based frameworks for multi-speaker
speech synthesis and speaker representation learning. A DGP has a deep architecture of …