A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

VQTTS: High-fidelity text-to-speech synthesis with self-supervised VQ acoustic feature

C Du, Y Guo, X Chen, K Yu - arxiv preprint arxiv:2204.00768, 2022 - arxiv.org
The mainstream neural text-to-speech (TTS) pipeline is a cascade system, including an
acoustic model (AM) that predicts acoustic feature from the input transcript and a vocoder …

Controllable accented text-to-speech synthesis with fine and coarse-grained intensity rendering

R Liu, B Sisman, G Gao, H Li - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a
variant of the standard version (L1), which is challenging as L2 is different from L1 in terms …

Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance

Y Guo, C Du, X Chen, K Yu - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Although current neural text-to-speech (TTS) models are able to generate high-quality
speech, intensity controllable emotional TTS is still a challenging task. Most existing …

Autoregressive diffusion transformer for text-to-speech synthesis

Z Liu, S Wang, S Inoue, Q Bai, H Li - arxiv preprint arxiv:2406.05551, 2024 - arxiv.org
Audio language models have recently emerged as a promising approach for various audio
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …

Prompttts++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions

R Shimizu, R Yamamoto, M Kawamura… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis system that
allows control over speaker identity using natural language descriptions. To control speaker …

Diffprosody: Diffusion-based latent prosody generation for expressive speech synthesis with prosody conditional adversarial training

HS Oh, SH Lee, SW Lee - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Expressive text-to-speech systems have undergone significant advancements owing to
prosody modeling, but conventional methods can still be improved. Traditional approaches …

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

Speaker adaptive text-to-speech with timbre-normalized vector-quantized feature

C Du, Y Guo, X Chen, K Yu - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Achieving high fidelity and speaker similarity in text-to-speech speaker adaptation with
limited amount of data is a challenging task. Most existing methods only consider adapting …

Acoustic modeling for end-to-end empathetic dialogue speech synthesis using linguistic and prosodic contexts of dialogue history

Y Nishimura, Y Saito, S Takamichi, K Tachibana… - arxiv preprint arxiv …, 2022 - arxiv.org
We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that
considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active …