Speech technology progress based on new machine learning paradigm

V Delić, Z Perić, M Sečujski… - Computational …, 2019 - Wiley Online Library
Speech technologies have been developed for decades as a typical signal processing area,
while the last decade has brought a huge progress based on new machine learning …

Controllable emotion transfer for end-to-end speech synthesis

T Li, S Yang, L Xue, L **e - 2021 12th International Symposium …, 2021 - ieeexplore.ieee.org
Emotion embedding space learned from references is a straight-forward approach for
emotion transfer in encoder-decoder structured emotional text to speech (TTS) systems …

Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis

T Li, X Wang, Q **e, Z Wang… - IEEE/ACM Transactions on …, 2022 - ieeexplore.ieee.org
The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims
to synthesize speech for a target speaker with the emotion transferred from reference …

iemotts: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre

G Zhang, Y Qin, W Zhang, J Wu, M Li… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
Cross-speaker emotion transfer is a common approach to generating emotional speech
when speech data with emotion labels from target speakers is not available. This paper …

Controlling emotion strength with relative attribute for end-to-end speech synthesis

X Zhu, S Yang, G Yang, L **e - 2019 IEEE Automatic Speech …, 2019 - ieeexplore.ieee.org
Recently, attention-based end-to-end speech synthesis has achieved superior performance
compared to traditional speech synthesis models, and several approaches like global style …

Multi-speaker emotional acoustic modeling for cnn-based speech synthesis

H Choi, S Park, J Park, M Hahn - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
In this paper, we investigate multi-speaker emotional acoustic modeling methods for
convolutional neural network (CNN) based speech synthesis system. For emotion modeling …

Hierarchical multi-grained generative model for expressive speech synthesis

Y Hono, K Tsuboi, K Sawada, K Hashimoto… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper proposes a hierarchical generative model with a multi-grained latent variable to
synthesize expressive speech. In recent years, fine-grained latent variables are introduced …

Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

K Inoue, S Hara, M Abe, N Hojo, Y Ijima - Speech Communication, 2021 - Elsevier
This paper proposes architectures that facilitate the extrapolation of emotional expressions
in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of …

A review of affective generation models

G Nie, Y Zhan - arxiv preprint arxiv:2202.10763, 2022 - arxiv.org
Affective computing is an emerging interdisciplinary field where computational systems are
developed to analyze, recognize, and influence the affective states of a human. It can …

Controllable Multi-Speaker Emotional Speech Synthesis With Emotion Representation of High Generalization Capability

J Zheng, J Zhou, W Zheng, L Tao… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The aim of multi-speaker emotional speech synthesis is to generate speech for a designated
speaker in a desired emotional state. The task is challenging due to the presence of speech …