Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
Uniaudio: An audio foundation model toward universal audio generation
Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …
Speechx: Neural codec language model as a versatile speech transformer
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Language-codec: Reducing the gaps between discrete codec representation and speech language models
In recent years, large language models have achieved significant success in generative
tasks (eg, speech cloning and audio generation) related to speech, audio, music, and other …
tasks (eg, speech cloning and audio generation) related to speech, audio, music, and other …
Wavmark: Watermarking for audio generation
Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice
using just a few seconds of recording while maintaining a high level of realism. Alongside its …
using just a few seconds of recording while maintaining a high level of realism. Alongside its …
Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey
T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …
aims to generate natural-sounding human speech from text. Recently, with the increasing …
Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models
Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective
method for improving the diversity and naturalness of synthesized speech. At the high level …
method for improving the diversity and naturalness of synthesized speech. At the high level …