Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

Autoregressive diffusion transformer for text-to-speech synthesis

Z Liu, S Wang, S Inoue, Q Bai, H Li - arxiv preprint arxiv:2406.05551, 2024 - arxiv.org
Audio language models have recently emerged as a promising approach for various audio
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …

Songcreator: Lyrics-based universal song generation

S Lei, Y Zhou, B Tang, MWY Lam… - Advances in …, 2025 - proceedings.neurips.cc
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …

Speech Editing--a Summary

T Kässmann, Y Liu, D Liu - arxiv preprint arxiv:2407.17172, 2024 - arxiv.org
With the rise of video production and social media, speech editing has become crucial for
creators to address issues like mispronunciations, missing words, or stuttering in audio …

E TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

Z Liang, Z Ma, C Du, K Yu… - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Text-based speech editing aims at manipulating part of real audio by modifying the
corresponding transcribed text, without being discernible by human auditory system. With …

Fluenteditor: Text-based speech editing by considering acoustic and prosody consistency

R Liu, J **, Z Jiang, H Li - arxiv preprint arxiv:2309.11725, 2023 - arxiv.org
Text-based speech editing (TSE) techniques are designed to enable users to edit the output
audio by modifying the input text transcript instead of the audio itself. Despite much progress …

FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency

R Liu, J **, Z Jiang, H Li - arxiv preprint arxiv:2410.03719, 2024 - arxiv.org
Text-based speech editing (TSE) allows users to modify speech by editing the
corresponding text and performing operations such as cutting, copying, and pasting to …

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

H Wang, M Yu, J Hai, C Chen, Y Hu, R Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for
stable, safe, and robust zero-shot text-based speech editing and text-to-speech synthesis …

Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

J Zuo, S Ji, M Fang, Z Jiang, X Cheng, Q Yang… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces PFlow-VC, a conditional flow matching voice conversion model that
leverages fine-grained discrete pitch tokens and target speaker prompt information for …

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Y Chen, Y Jia, S Zhao, Z Jiang, H Li, J Kang… - arxiv preprint arxiv …, 2024 - arxiv.org
As text-based speech editing becomes increasingly prevalent, the demand for unrestricted
free-text editing continues to grow. However, existing speech editing techniques encounter …