Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …

Multimodal Latent Language Modeling with Next-Token Diffusion

Y Sun, H Bao, W Wang, Z Peng, L Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …

HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis

Y Nishimura, T Hirose, M Ohi, H Nakayama… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, Text-to-speech (TTS) models based on large language models (LLMs) that
translate natural language text into sequences of discrete audio tokens have gained great …

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Z Du, Y Wang, Q Chen, X Shi, X Lv, T Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …

Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset

X Li, Z Shang, H Hua, P Shi, C Yang, L Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large-scale speech generation models have achieved impressive performance in the zero-
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …

JetFormer: An autoregressive generative model of raw images and text

M Tschannen, AS Pinto, A Kolesnikov - arxiv preprint arxiv:2411.19722, 2024 - arxiv.org
Removing modeling constraints and unifying architectures across domains has been a key
driver of the recent progress in training large multimodal models. However, most of these …

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

YA Li, X Jiang, C Han, N Mesgarani - arxiv preprint arxiv:2409.10058, 2024 - arxiv.org
The rapid development of large-scale text-to-speech (TTS) models has led to significant
advancements in modeling diverse speaker prosody and voices. However, these models …

Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Z Wang, YW Tai, CK Tang - arxiv preprint arxiv:2410.03335, 2024 - arxiv.org
We introduce Audio-Agent, a multimodal framework for audio generation, editing and
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …