Investigating neural audio codecs for speech language model-based speech generation
Neural audio codec tokens serve as the fundamental building blocks for speech language
model (SLM)-based speech generation. However, there is no systematic understanding on …
model (SLM)-based speech generation. However, there is no systematic understanding on …
Multimodal Latent Language Modeling with Next-Token Diffusion
Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …
text and code) and continuous data (eg, image, audio, video). In this work, we propose …
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Recent advancements highlight the potential of end-to-end real-time spoken dialogue
systems, showcasing their low latency and high quality. In this paper, we introduce SLAM …
systems, showcasing their low latency and high quality. In this paper, we introduce SLAM …
Scaling speech-text pre-training with synthetic interleaved data
Speech language models (SpeechLMs) accept speech input and produce speech output,
allowing for more natural human-computer interaction compared to text-based large …
allowing for more natural human-computer interaction compared to text-based large …