Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - ar** large language models (LLMs) have brought tremendous intelligent
applications. Especially, the GPT-4o's excellent duplex speech interaction ability has …

Investigating neural audio codecs for speech language model-based speech generation

J Li, D Wang, X Wang, Y Qian, L Zhou… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Neural audio codec tokens serve as the fundamental building blocks for speech language
model (SLM)-based speech generation. However, there is no systematic understanding on …

Multimodal Latent Language Modeling with Next-Token Diffusion

Y Sun, H Bao, W Wang, Z Peng, L Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

W Chen, Z Ma, R Yan, Y Liang, X Li, R Xu, Z Niu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements highlight the potential of end-to-end real-time spoken dialogue
systems, showcasing their low latency and high quality. In this paper, we introduce SLAM …

Scaling speech-text pre-training with synthetic interleaved data

A Zeng, Z Du, M Liu, L Zhang, S Jiang, Y Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Speech language models (SpeechLMs) accept speech input and produce speech output,
allowing for more natural human-computer interaction compared to text-based large …