On the design fundamentals of diffusion models: A survey

Z Chang, GA Koulieris, HPH Shum - arxiv preprint arxiv:2306.04542, 2023 - arxiv.org
Diffusion models are generative models, which gradually add and remove noise to learn the
underlying distribution of training data for data generation. The components of diffusion …

Enhancing emotional text-to-speech controllability with natural language guidance through contrastive learning and diffusion models

X **g, K Zhou, A Triantafyllopoulos… - arxiv preprint arxiv …, 2024 - arxiv.org
While current emotional text-to-speech (TTS) systems can generate highly intelligible
emotional speech, achieving fine control over emotion rendering of the output speech still …

Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset

X Li, Z Shang, H Hua, P Shi, C Yang, L Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large-scale speech generation models have achieved impressive performance in the zero-
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …

Dex-tts: Diffusion-based expressive text-to-speech with style modeling on time variability

HJ Park, JS Kim, W Shin, SW Han - arxiv preprint arxiv:2406.19135, 2024 - arxiv.org
Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to
synthesize natural speech, but there are limitations to obtaining well-represented styles and …

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

X Ma, G Fang, MB Mi, X Wang - arxiv preprint arxiv:2406.01733, 2024 - arxiv.org
Diffusion Transformers have recently demonstrated unprecedented generative capabilities
for various tasks. The encouraging results, however, come with the cost of slow inference …

Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising

G Fang, X Ma, X Wang - arxiv preprint arxiv:2412.05628, 2024 - arxiv.org
Transformer-based diffusion models have achieved significant advancements across a
variety of generative tasks. However, producing high-quality outputs typically necessitates …

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech

X Qi, R Fu, Z Wen, T Wang, C Qiang, J Tao, C Li… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used
U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have …

Personalized and Controllable Voice Style Transfer with Speech Diffusion Transformer

HY Choi, SH Lee, SW Lee - IEEE Transactions on Audio …, 2025 - ieeexplore.ieee.org
Although speech synthesis systems have remarkably advanced with their expansion into
various applications, achieving robust voice style transfer while maintaining high-quality in …