Moshi: a speech-text foundation model for real-time dialogue

A Défossez, L Mazaré, M Orsini, A Royer… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …

Recent advances in speech language models: A survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Llama-omni: Seamless speech interaction with large language models

Q Fang, S Guo, Y Zhou, Z Ma, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Models like GPT-4o enable real-time interaction with large language models (LLMs) through
speech, significantly enhancing user experience compared to traditional text-based …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Style-talker: Finetuning audio language model and style-based text-to-speech model for fast spoken dialogue generation

YA Li, X Jiang, J Darefsky, G Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of large language models (LLMs) has significantly propelled the
development of text-based chatbots, demonstrating their capability to engage in coherent …

Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models

W Liu, Z Guo, J Xu, Y Lv, Y Chu, Z Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
Building upon advancements in Large Language Models (LLMs), the field of audio
processing has seen increased interest in training audio generation tasks with discrete …

DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

J Du, IM Lin, IH Chiu, X Chen, H Wu… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Mainstream zero-shot TTS production systems like Voicebox and Seed-TTS achieve human
parity speech by leveraging Flow-matching and Diffusion models, respectively …

Body of Her: A Preliminary Study on End-to-End Humanoid Agent

T Ao - arxiv preprint arxiv:2408.02879, 2024 - arxiv.org
Interactive virtual humanoid agent is a crucial interface with the physical world. A relatively
complete humanoid agent first needs to have face and body, then possess both verbal and …

Neural Codec Source Tracing: Toward Comprehensive Attribution in Open-Set Condition

Y **e, X Wang, Z Wang, R Fu, Z Wen, S Cao… - arxiv preprint arxiv …, 2025 - arxiv.org
Current research in audio deepfake detection is gradually transitioning from binary
classification to multi-class tasks, referred as audio deepfake source tracing task. However …

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

K Mitsui, K Mitsuda, T Wakatsuki, Y Hono… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal language models that process both text and speech have a potential for
applications in spoken dialogue systems. However, current models face two major …