- Academic Search

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

保存引用被引用次数：16 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org

Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …

保存引用被引用次数：1 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Multimodal Latent Language Modeling with Next-Token Diffusion

Y Sun, H Bao, W Wang, Z Peng, L Dong… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …

保存引用被引用次数：1 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis

Y Nishimura, T Hirose, M Ohi, H Nakayama… - arxiv preprint arxiv …, 2024 - arxiv.org

Recently, Text-to-speech (TTS) models based on large language models (LLMs) that
translate natural language text into sequences of discrete audio tokens have gained great …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z Du, Y Wang, Q Chen, X Shi, X Lv, T Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …

保存引用被引用次数：1 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset

X Li, Z Shang, H Hua, P Shi, C Yang, L Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large-scale speech generation models have achieved impressive performance in the zero-
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

JetFormer: An autoregressive generative model of raw images and text

M Tschannen, AS Pinto, A Kolesnikov - arxiv preprint arxiv:2411.19722, 2024 - arxiv.org

Removing modeling constraints and unifying architectures across domains has been a key
driver of the recent progress in training large multimodal models. However, most of these …

保存引用被引用次数：1 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

YA Li, X Jiang, C Han, N Mesgarani - arxiv preprint arxiv:2409.10058, 2024 - arxiv.org

The rapid development of large-scale text-to-speech (TTS) models has led to significant
advancements in modeling diverse speaker prosody and voices. However, these models …

保存引用被引用次数：2 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Z Wang, YW Tai, CK Tang - arxiv preprint arxiv:2410.03335, 2024 - arxiv.org

We introduce Audio-Agent, a multimodal framework for audio generation, editing and
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …

保存引用被引用次数：1 相关文章 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Autoregressive speech synthesis without vector quantization

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

Multimodal Latent Language Modeling with Next-Token Diffusion

HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset

JetFormer: An autoregressive generative model of raw images and text

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition