Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey
T **e, Y Rong, P Zhang, L Liu - arxiv preprint arxiv:2412.06602, 2024 - arxiv.org
Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that
aims to generate natural-sounding human speech from text. Recently, with the increasing …
aims to generate natural-sounding human speech from text. Recently, with the increasing …
Multimodal Latent Language Modeling with Next-Token Diffusion
Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …
text and code) and continuous data (eg, image, audio, video). In this work, we propose …
HALL-E: hierarchical neural codec language model for minute-long zero-shot text-to-speech synthesis
Recently, Text-to-speech (TTS) models based on large language models (LLMs) that
translate natural language text into sequences of discrete audio tokens have gained great …
translate natural language text into sequences of discrete audio tokens have gained great …
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …
based on supervised discrete speech tokens. By employing progressive semantic decoding …
Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset
X Li, Z Shang, H Hua, P Shi, C Yang, L Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large-scale speech generation models have achieved impressive performance in the zero-
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …
shot voice clone tasks relying on large-scale datasets. However, exploring how to achieve …
JetFormer: An autoregressive generative model of raw images and text
Removing modeling constraints and unifying architectures across domains has been a key
driver of the recent progress in training large multimodal models. However, most of these …
driver of the recent progress in training large multimodal models. However, most of these …
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
The rapid development of large-scale text-to-speech (TTS) models has led to significant
advancements in modeling diverse speaker prosody and voices. However, these models …
advancements in modeling diverse speaker prosody and voices. However, these models …
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
We introduce Audio-Agent, a multimodal framework for audio generation, editing and
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …
composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) …