Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Z Jiang, J Liu, Y Ren, J He, Z Ye, S Ji… - The Twelfth …, 2024 - openreview.net
Zero-shot text-to-speech (TTS) aims to synthesize voices with unseen speech prompts,
which significantly reduces the data and computation requirements for voice cloning by …

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

X Zhu, W Tian, X Wang, L He, Y **ao, X Wang… - Proceedings of the …, 2024 - dl.acm.org
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …

Speechcraft: A fine-grained expressive speech dataset with natural language description

Z **, J Jia, Q Wang, K Li, S Zhou, S Zhou… - Proceedings of the …, 2024 - dl.acm.org
Speech-language multi-modal learning presents a significant challenge due to the fine
nuanced information inherent in speech styles. Therefore, a large-scale dataset providing …

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …

Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis

SH Lee, HY Choi, SB Kim, SW Lee - arxiv preprint arxiv:2311.12454, 2023 - arxiv.org
Large language models (LLM)-based speech synthesis has been widely adopted in zero-
shot speech synthesis. However, they require a large-scale data and possess the same …

Voxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling

Y Zhou, X Qin, Z **, S Zhou, S Lei, S Zhou… - Proceedings of the …, 2024 - dl.acm.org
Recent AIGC systems possess the capability to generate digital multimedia content based
on human language instructions, such as text, image and video. However, when it comes to …

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

D Lyth, S King - arxiv preprint arxiv:2402.01912, 2024 - arxiv.org
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-
context learning capabilities and naturalness. However, control of speaker identity and style …