Neural source-filter waveform models for statistical parametric speech synthesis

X Wang, S Takaki, J Yamagishi - IEEE/ACM Transactions on …, 2019 - ieeexplore.ieee.org
Neural waveform models have demonstrated better performance than conventional
vocoders for statistical parametric speech synthesis. One of the best models, called …

Neural source-filter-based waveform model for statistical parametric speech synthesis

X Wang, S Takaki, J Yamagishi - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
Neural waveform models such as the WaveNet are used in many recent text-to-speech
systems, but the original WaveNet is quite slow in waveform generation because of its …

Investigation of enhanced Tacotron text-to-speech synthesis systems with self-attention for pitch accent language

Y Yasuda, X Wang, S Takaki… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
End-to-end speech synthesis is a promising approach that directly converts raw text to
speech. Although it was shown that Tacotron2 outperforms classical pipeline systems with …

I'm sorry for your loss: Spectrally-based audio distances are bad at pitch

J Turian, M Henry - arxiv preprint arxiv:2012.04572, 2020 - arxiv.org
Growing research demonstrates that synthetic failure modes imply poor generalization. We
compare commonly used audio-to-audio losses on a synthetic benchmark, measuring the …

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Y Yasuda, X Wang, J Yamagishi - Computer Speech & Language, 2021 - Elsevier
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality
speech directly from text or simple linguistic features such as phonemes. Unlike traditional …

Prosodic features control by symbols as input of sequence-to-sequence acoustic modeling for neural TTS

K Kurihara, N Seiyama, T Kumano - IEICE Transactions on …, 2021 - search.ieice.org
This paper describes a method to control prosodic features using phonetic and prosodic
symbols as input of attention-based sequence-to-sequence (seq2seq) acoustic modeling …

Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis

X Wang, J Yamagishi - arxiv preprint arxiv:1908.10256, 2019 - arxiv.org
Neural source-filter (NSF) models are deep neural networks that produce waveforms given
input acoustic features. They use dilated-convolution-based neural filter modules to filter …

Training multi-speaker neural text-to-speech systems using speaker-imbalanced speech corpora

HT Luong, X Wang, J Yamagishi… - arxiv preprint arxiv …, 2019 - arxiv.org
When the available data of a target speaker is insufficient to train a high quality speaker-
dependent neural text-to-speech (TTS) system, we can combine data from multiple speakers …

Initial investigation of an encoder-decoder end-to-end TTS framework using marginalization of monotonic hard latent alignments

Y Yasuda, X Wang, J Yamagishi - arxiv preprint arxiv:1908.11535, 2019 - arxiv.org
End-to-end text-to-speech (TTS) synthesis is a method that directly converts input text to
output acoustic features using a single network. A recent advance of end-to-end TTS is due …

Modeling of Rakugo speech and its limitations: Toward speech synthesis that entertains audiences

S Kato, Y Yasuda, X Wang, E Cooper, S Takaki… - IEEE …, 2020 - ieeexplore.ieee.org
We have been investigating rakugo speech synthesis as a challenging example of speech
synthesis that entertains audiences. Rakugo is a traditional Japanese form of verbal …