Wavllm: Towards robust and adaptive speech large language model

S Hu, L Zhou, S Liu, S Chen, L Meng, H Hao… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent advancements in large language models (LLMs) have revolutionized the field of
natural language processing, progressively broadening their scope to multimodal …

Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

SE Eskimez, X Wang, M Thakker, C Li… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

X Zhu, W Tian, X Wang, L He, Y **ao, X Wang… - Proceedings of the …, 2024 - dl.acm.org
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

W Yu, S Wang, X Yang, X Chen, X Tian… - arxiv preprint arxiv …, 2024 - arxiv.org
Full-duplex multimodal large language models (LLMs) provide a unified framework for
addressing diverse speech understanding and generation tasks, enabling more natural and …

Enhancing automatic speech recognition with personalized models: Improving accuracy through individualized fine-tuning

V Brydinskyi, D Sabodashko, Y Khoma… - IEEE …, 2024 - ieeexplore.ieee.org
Automatic speech recognition (ASR) systems have become increasingly popular in recent
years due to their ability to convert spoken language into text. Nonetheless, despite their …

BASE TTS: Lessons from building a billion-parameter text-to-speech model on 100K hours of data

M Łajszczak, G Cámbara, Y Li, F Beyhan… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf {B} $
ig $\textbf {A} $ daptive $\textbf {S} $ treamable TTS with $\textbf {E} $ mergent abilities …

Multimodal Latent Language Modeling with Next-Token Diffusion

Y Sun, H Bao, W Wang, Z Peng, L Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …