Voicecraft: Zero-shot speech editing and text-to-speech in the wild

P Peng, PY Huang, SW Li, A Mohamed… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …

Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

Songcreator: Lyrics-based universal song generation

S Lei, Y Zhou, B Tang, MWY Lam, F Liu, H Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …

Speech Editing--a Summary

T Kässmann, Y Liu, D Liu - arxiv preprint arxiv:2407.17172, 2024 - arxiv.org
With the rise of video production and social media, speech editing has become crucial for
creators to address issues like mispronunciations, missing words, or stuttering in audio …

E TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications

Z Liang, Z Ma, C Du, K Yu… - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Text-based speech editing aims at manipulating part of real audio by modifying the
corresponding transcribed text, without being discernible by human auditory system. With …

FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency

R Liu, J **, Z Jiang, H Li - arxiv preprint arxiv:2410.03719, 2024 - arxiv.org
Text-based speech editing (TSE) allows users to modify speech by editing the
corresponding text and performing operations such as cutting, copying, and pasting to …

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

H Wang, M Yu, J Hai, C Chen, Y Hu, R Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for
stable, safe, and robust zero-shot text-based speech editing and text-to-speech synthesis …

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Y Chen, Y Jia, S Zhao, Z Jiang, H Li, J Kang… - arxiv preprint arxiv …, 2024 - arxiv.org
As text-based speech editing becomes increasingly prevalent, the demand for unrestricted
free-text editing continues to grow. However, existing speech editing techniques encounter …

Autoregressive Diffusion Transformer for Text-to-Speech Synthesis

Z Liu, S Wang, S Inoue, Q Bai, H Li - arxiv preprint arxiv:2406.05551, 2024 - arxiv.org
Audio language models have recently emerged as a promising approach for various audio
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …

MMSD-Net: Towards Multi-modal Stuttering Detection

L Nie, SR Kadiri, R Agrawal - arxiv preprint arxiv:2407.11492, 2024 - arxiv.org
Stuttering is a common speech impediment that is caused by irregular disruptions in speech
production, affecting over 70 million people across the world. Standard automatic speech …