- Academic Search

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - ar** large language models (LLMs) have brought tremendous intelligent
applications. Especially, the GPT-4o's excellent duplex speech interaction ability has …

保存引用被引用次数：10 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Investigating neural audio codecs for speech language model-based speech generation

J Li, D Wang, X Wang, Y Qian, L Zhou… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org

Neural audio codec tokens serve as the fundamental building blocks for speech language
model (SLM)-based speech generation. However, there is no systematic understanding on …

保存引用被引用次数：3 相关文章所有 3 个版本

[Free GPT-4]

[PDF] arxiv.org

Multimodal Latent Language Modeling with Next-Token Diffusion

Y Sun, H Bao, W Wang, Z Peng, L Dong… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal generative models require a unified approach to handle both discrete data (eg,
text and code) and continuous data (eg, image, audio, video). In this work, we propose …

保存引用被引用次数：1 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

W Chen, Z Ma, R Yan, Y Liang, X Li, R Xu, Z Niu… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements highlight the potential of end-to-end real-time spoken dialogue
systems, showcasing their low latency and high quality. In this paper, we introduce SLAM …

保存引用被引用次数：2 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Scaling speech-text pre-training with synthetic interleaved data

A Zeng, Z Du, M Liu, L Zhang, S Jiang, Y Dong… - arxiv preprint arxiv …, 2024 - arxiv.org

Speech language models (SpeechLMs) accept speech input and produce speech output,
allowing for more natural human-computer interaction compared to text-based large …

保存引用被引用次数：1 相关文章所有 3 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Autoregressive speech synthesis without vector quantization

Investigating neural audio codecs for speech language model-based speech generation

Multimodal Latent Language Modeling with Next-Token Diffusion

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Scaling speech-text pre-training with synthetic interleaved data