Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - ar** large language models (LLMs) have brought tremendous intelligent
applications. Especially, the GPT-4o's excellent duplex speech interaction ability has …

Cosyvoice 2: Scalable streaming speech synthesis with large language models

Z Du, Y Wang, Q Chen, X Shi, X Lv, T Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …

Scaling speech-text pre-training with synthetic interleaved data

A Zeng, Z Du, M Liu, L Zhang, S Jiang, Y Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Speech language models (SpeechLMs) accept speech input and produce speech output,
allowing for more natural human-computer interaction compared to text-based large …

Investigating neural audio codecs for speech language model-based speech generation

J Li, D Wang, X Wang, Y Qian, L Zhou… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Neural audio codec tokens serve as the fundamental building blocks for speech language
model (SLM)-based speech generation. However, there is no systematic understanding on …

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

W Chen, Z Ma, R Yan, Y Liang, X Li, R Xu, Z Niu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements highlight the potential of end-to-end real-time spoken dialogue
systems, showcasing their low latency and high quality. In this paper, we introduce SLAM …