Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arxiv preprint arxiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Matcha-TTS: A fast TTS architecture with conditional flow matching

S Mehta, R Tu, J Beskow, É Székely… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic
modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields …

Voiceflow: Efficient text-to-speech with rectified flow matching

Y Guo, C Du, Z Ma, X Chen, K Yu - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Although diffusion models in text-to-speech have become a popular choice due to their
strong generative ability, the intrinsic complexity of sampling from diffusion models harms …

Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

Schrodinger bridges beat diffusion models on text-to-speech synthesis

Z Chen, G He, K Zheng, X Tan, J Zhu - arxiv preprint arxiv:2312.03491, 2023 - arxiv.org
In text-to-speech (TTS) synthesis, diffusion models have achieved promising generation
quality. However, because of the pre-defined data-to-noise diffusion process, their prior …

Autoregressive diffusion transformer for text-to-speech synthesis

Z Liu, S Wang, S Inoue, Q Bai, H Li - arxiv preprint arxiv:2406.05551, 2024 - arxiv.org
Audio language models have recently emerged as a promising approach for various audio
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …

Audiolcm: Text-to-audio generation with latent consistency models

H Liu, R Huang, Y Liu, H Cao, J Wang, X Cheng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the
forefront of various generative tasks. However, their iterative sampling process poses a …

Reflow-tts: A rectified flow model for high-fidelity text-to-speech

W Guan, Q Su, H Zhou, S Miao, X **e… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The diffusion models including Denoising Diffusion Probabilistic Models (DDPM) and score-
based generative models have demonstrated excellent performance in speech synthesis …