- Academic Search

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arxiv preprint arxiv …, 2023 - arxiv.org

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

保存引用被引用次数：216 相关文章所有 3 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Uniaudio: An audio foundation model toward universal audio generation

D Yang, J Tian, X Tan, R Huang, S Liu, X Chang… - arxiv preprint arxiv …, 2023 - arxiv.org

Large Language models (LLM) have demonstrated the capability to handle a variety of
generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific …

保存引用被引用次数：105 相关文章所有 3 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

保存引用被引用次数：71 相关文章所有 2 个版本

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Z Ju, Y Wang, K Shen, X Tan, D **n, D Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

While recent large-scale text-to-speech (TTS) models have achieved significant progress,
they still fall short in speech quality, similarity, and prosody. Considering speech intricately …

保存引用被引用次数：134 相关文章所有 4 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

保存引用被引用次数：2 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language-codec: Reducing the gaps between discrete codec representation and speech language models

S Ji, M Fang, Z Jiang, S Zheng, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, large language models have achieved significant success in generative
tasks (eg, speech cloning and audio generation) related to speech, audio, music, and other …

保存引用被引用次数：15 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavmark: Watermarking for audio generation

G Chen, Y Wu, S Liu, T Liu, X Du, F Wei - arxiv preprint arxiv:2308.12770, 2023 - arxiv.org

Recent breakthroughs in zero-shot voice synthesis have enabled imitating a speaker's voice
using just a few seconds of recording while maintaining a high level of realism. Alongside its …

保存引用被引用次数：40 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

保存引用被引用次数：7 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org

Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …

保存引用被引用次数：1 相关文章所有 2 个版本 HTML 版

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models

D Yang, R Huang, Y Wang, H Guo, D Chong… - arxiv preprint arxiv …, 2024 - arxiv.org

Scaling Text-to-speech (TTS) to large-scale datasets has been demonstrated as an effective
method for improving the diversity and naturalness of synthesized speech. At the high level …

保存引用被引用次数：4 相关文章所有 2 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Make-a-voice: Unified voice synthesis with discrete representation

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

Uniaudio: An audio foundation model toward universal audio generation

Speechx: Neural codec language model as a versatile speech transformer

Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Language-codec: Reducing the gaps between discrete codec representation and speech language models

Wavmark: Watermarking for audio generation

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

Simplespeech 2: Towards simple and efficient text-to-speech with flow-based scalar latent transformer diffusion models