Voicecraft: Zero-shot speech editing and text-to-speech in the wild
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on …
Flashspeech: Efficient zero-shot speech synthesis
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …
by language models and diffusion models. However, the generation process of both …
Songcreator: Lyrics-based universal song generation
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …
which songs compose an essential part. While various aspects of song generation have …
Speech Editing--a Summary
T Kässmann, Y Liu, D Liu - arxiv preprint arxiv:2407.17172, 2024 - arxiv.org
With the rise of video production and social media, speech editing has become crucial for
creators to address issues like mispronunciations, missing words, or stuttering in audio …
creators to address issues like mispronunciations, missing words, or stuttering in audio …
E TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
Text-based speech editing aims at manipulating part of real audio by modifying the
corresponding transcribed text, without being discernible by human auditory system. With …
corresponding transcribed text, without being discernible by human auditory system. With …
FluentEditor+: Text-based Speech Editing by Modeling Local Hierarchical Acoustic Smoothness and Global Prosody Consistency
Text-based speech editing (TSE) allows users to modify speech by editing the
corresponding text and performing operations such as cutting, copying, and pasting to …
corresponding text and performing operations such as cutting, copying, and pasting to …
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for
stable, safe, and robust zero-shot text-based speech editing and text-to-speech synthesis …
stable, safe, and robust zero-shot text-based speech editing and text-to-speech synthesis …
DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency
As text-based speech editing becomes increasingly prevalent, the demand for unrestricted
free-text editing continues to grow. However, existing speech editing techniques encounter …
free-text editing continues to grow. However, existing speech editing techniques encounter …
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Audio language models have recently emerged as a promising approach for various audio
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …
generation tasks, relying on audio tokenizers to encode waveforms into sequences of …
MMSD-Net: Towards Multi-modal Stuttering Detection
Stuttering is a common speech impediment that is caused by irregular disruptions in speech
production, affecting over 70 million people across the world. Standard automatic speech …
production, affecting over 70 million people across the world. Standard automatic speech …