A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

S Nyatsanga, T Kucherenko, C Ahuja… - Computer Graphics …, 2023 - Wiley Online Library
Gestures that accompany speech are an essential part of natural and efficient embodied
human communication. The automatic generation of such co‐speech gestures is a long …

Deep encoder-decoder models for unsupervised learning of controllable speech synthesis

GE Henter, J Lorenzo-Trueba, X Wang… - arxiv preprint arxiv …, 2018 - arxiv.org
Generating versatile and appropriate synthetic speech requires control over the output
expression separate from the spoken text. Important non-textual speech variation is seldom …

From HMMs to DNNs: where do the improvements come from?

O Watts, GE Henter, T Merritt, Z Wu… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Deep neural networks (DNNs) have recently been the focus of much text-to-speech research
as a replacement for decision trees and hidden Markov models (HMMs) in statistical …

Neural HMMs are all you need (for high-quality attention-free TTS)

S Mehta, É Székely, J Beskow… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Neural sequence-to-sequence TTS has achieved significantly better output quality than
statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic …

OverFlow: Putting flows on top of neural transducers for better TTS

S Mehta, A Kirkland, H Lameris, J Beskow… - arxiv preprint arxiv …, 2022 - arxiv.org
Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence
modelling in text-to-speech. They combine the best features of classic statistical speech …

Deep neural network-guided unit selection synthesis

T Merritt, RAJ Clark, Z Wu… - … on Acoustics, Speech …, 2016 - ieeexplore.ieee.org
Vocoding of speech is a standard part of statistical parametric speech synthesis systems. It
imposes an upper bound of the naturalness that can possibly be achieved. Hybrid systems …

Ctrl-P: Temporal control of prosodic variation for speech synthesis

DSR Mohan, V Hu, TH Teh, A Torresquintero… - arxiv preprint arxiv …, 2021 - arxiv.org
Text does not fully specify the spoken form, so text-to-speech models must be able to learn
from speech data that vary in ways not explained by the corresponding text. One way to …

An autoregressive recurrent mixture density network for parametric speech synthesis

X Wang, S Takaki, J Yamagishi - 2017 IEEE international …, 2017 - ieeexplore.ieee.org
Neural-network-based generative models, such as mixture density networks, are potential
solutions for speech synthesis. In this paper we follow this path and propose a recurrent …

Principles for learning controllable TTS from annotated and latent variation

G Henter, J Lorenzo-Trueba, X Wang… - Interspeech …, 2017 - research.ed.ac.uk
For building flexible and appealing high-quality speech synthesisers, it is desirable to be
able to accommodate and reproduce fine variations in vocal expression present in natural …

Robust TTS duration modelling using DNNs

GE Henter, S Ronanki, O Watts… - … , Speech and Signal …, 2016 - ieeexplore.ieee.org
Accurate modelling and prediction of speech-sound durations is an important component in
generating more natural synthetic speech. Deep neural networks (DNNs) offer a powerful …