Recent advances in speech language models: A survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Llama-omni: Seamless speech interaction with large language models

Q Fang, S Guo, Y Zhou, Z Ma, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Models like GPT-4o enable real-time interaction with large language models (LLMs) through
speech, significantly enhancing user experience compared to traditional text-based …

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Y Wang, H Zhan, L Liu, R Zeng, H Guo, J Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive
and non-autoregressive systems. The autoregressive systems implicitly model duration but …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Z **e, C Wu - arxiv preprint arxiv:2410.11190, 2024 - arxiv.org
GPT-4o, an all-encompassing model, represents a milestone in the development of large
multi-modal language models. It can understand visual, auditory, and textual modalities …

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

HH Guo, K Liu, FY Shen, YC Wu, FL **e, K **e… - arxiv preprint arxiv …, 2024 - arxiv.org
This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the
growing demands for personalized and diverse generative speech applications. The …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Cosyvoice 2: Scalable streaming speech synthesis with large language models

Z Du, Y Wang, Q Chen, X Shi, X Lv, T Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …