A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation
Gestures that accompany speech are an essential part of natural and efficient embodied
human communication. The automatic generation of such co‐speech gestures is a long …
human communication. The automatic generation of such co‐speech gestures is a long …
Deep encoder-decoder models for unsupervised learning of controllable speech synthesis
Generating versatile and appropriate synthetic speech requires control over the output
expression separate from the spoken text. Important non-textual speech variation is seldom …
expression separate from the spoken text. Important non-textual speech variation is seldom …
From HMMs to DNNs: where do the improvements come from?
Deep neural networks (DNNs) have recently been the focus of much text-to-speech research
as a replacement for decision trees and hidden Markov models (HMMs) in statistical …
as a replacement for decision trees and hidden Markov models (HMMs) in statistical …
Neural HMMs are all you need (for high-quality attention-free TTS)
Neural sequence-to-sequence TTS has achieved significantly better output quality than
statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic …
statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic …
OverFlow: Putting flows on top of neural transducers for better TTS
Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence
modelling in text-to-speech. They combine the best features of classic statistical speech …
modelling in text-to-speech. They combine the best features of classic statistical speech …
Deep neural network-guided unit selection synthesis
Vocoding of speech is a standard part of statistical parametric speech synthesis systems. It
imposes an upper bound of the naturalness that can possibly be achieved. Hybrid systems …
imposes an upper bound of the naturalness that can possibly be achieved. Hybrid systems …
Ctrl-P: Temporal control of prosodic variation for speech synthesis
Text does not fully specify the spoken form, so text-to-speech models must be able to learn
from speech data that vary in ways not explained by the corresponding text. One way to …
from speech data that vary in ways not explained by the corresponding text. One way to …
An autoregressive recurrent mixture density network for parametric speech synthesis
Neural-network-based generative models, such as mixture density networks, are potential
solutions for speech synthesis. In this paper we follow this path and propose a recurrent …
solutions for speech synthesis. In this paper we follow this path and propose a recurrent …
Principles for learning controllable TTS from annotated and latent variation
For building flexible and appealing high-quality speech synthesisers, it is desirable to be
able to accommodate and reproduce fine variations in vocal expression present in natural …
able to accommodate and reproduce fine variations in vocal expression present in natural …
Robust TTS duration modelling using DNNs
Accurate modelling and prediction of speech-sound durations is an important component in
generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful …
generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful …