Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
Wavchat: A survey of spoken dialogue models
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …
have captured significant attention in the speech domain. Compared to traditional three-tier …
Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts,
which significantly reduces the data and computation requirements for voice cloning by …
which significantly reduces the data and computation requirements for voice cloning by …
Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …
responding with speech in an appropriate style is a natural occurrence in human …
Speechcraft: A fine-grained expressive speech dataset with natural language description
Speech-language multi-modal learning presents a significant challenge due to the fine
nuanced information inherent in speech styles. Therefore, a large-scale dataset providing …
nuanced information inherent in speech styles. Therefore, a large-scale dataset providing …
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey
T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …
aims to generate natural-sounding human speech from text. Recently, with the increasing …
Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis
Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …
shot speech synthesis. However, they require a large-scale data and possess the same …
Voxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling
Recent AIGC systems possess the capability to generate digital multimedia content based
on human language instructions, such as text, image and video. However, when it comes to …
on human language instructions, such as text, image and video. However, when it comes to …
Natural language guidance of high-fidelity text-to-speech with synthetic annotations
D Lyth, S King - arxiv preprint arxiv:2402.01912, 2024 - arxiv.org
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-
context learning capabilities and naturalness. However, control of speaker identity and style …
context learning capabilities and naturalness. However, control of speaker identity and style …