On the design fundamentals of diffusion models: A survey
Diffusion models are generative models, which gradually add and remove noise to learn the
underlying distribution of training data for data generation. The components of diffusion …
underlying distribution of training data for data generation. The components of diffusion …
Enhancing emotional text-to-speech controllability with natural language guidance through contrastive learning and diffusion models
While current emotional text-to-speech (TTS) systems can generate highly intelligible
emotional speech, achieving fine control over emotion rendering of the output speech still …
emotional speech, achieving fine control over emotion rendering of the output speech still …
Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset
X Li, Z Shang, H Hua, P Shi, C Yang, L Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large-scale speech generation models have achieved impressive performance in the zero-
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …
Dex-tts: Diffusion-based expressive text-to-speech with style modeling on time variability
Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to
synthesize natural speech, but there are limitations to obtaining well-represented styles and …
synthesize natural speech, but there are limitations to obtaining well-represented styles and …
Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching
Diffusion Transformers have recently demonstrated unprecedented generative capabilities
for various tasks. The encouraging results, however, come with the cost of slow inference …
for various tasks. The encouraging results, however, come with the cost of slow inference …
Remix-DiT: Mixing Diffusion Transformers for Multi-Expert Denoising
Transformer-based diffusion models have achieved significant advancements across a
variety of generative tasks. However, producing high-quality outputs typically necessitates …
variety of generative tasks. However, producing high-quality outputs typically necessitates …
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used
U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have …
U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have …
Personalized and Controllable Voice Style Transfer with Speech Diffusion Transformer
HY Choi, SH Lee, SW Lee - IEEE Transactions on Audio …, 2025 - ieeexplore.ieee.org
Although speech synthesis systems have remarkably advanced with their expansion into
various applications, achieving robust voice style transfer while maintaining high-quality in …
various applications, achieving robust voice style transfer while maintaining high-quality in …