„Google“ mokslinčius

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - ar** large language models (LLMs) have brought tremendous intelligent
applications. Especially, the GPT-4o's excellent duplex speech interaction ability has …

Išsaugoti Cituoti Cituoja 15 Susiję straipsniai Visos 2 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cosyvoice 2: Scalable streaming speech synthesis with large language models

Z Du, Y Wang, Q Chen, X Shi, X Lv, T Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …

Išsaugoti Cituoti Cituoja 7 Susiję straipsniai Visos 4 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling speech-text pre-training with synthetic interleaved data

A Zeng, Z Du, M Liu, L Zhang, S Jiang, Y Dong… - arxiv preprint arxiv …, 2024 - arxiv.org

Speech language models (SpeechLMs) accept speech input and produce speech output,
allowing for more natural human-computer interaction compared to text-based large …

Išsaugoti Cituoti Cituoja 2 Susiję straipsniai Visos 3 versijos HTML kopija

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Investigating neural audio codecs for speech language model-based speech generation

J Li, D Wang, X Wang, Y Qian, L Zhou… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org

Neural audio codec tokens serve as the fundamental building blocks for speech language
model (SLM)-based speech generation. However, there is no systematic understanding on …

Išsaugoti Cituoti Cituoja 3 Susiję straipsniai Visos 3 versijos

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

W Chen, Z Ma, R Yan, Y Liang, X Li, R Xu, Z Niu… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements highlight the potential of end-to-end real-time spoken dialogue
systems, showcasing their low latency and high quality. In this paper, we introduce SLAM …

Išsaugoti Cituoti Cituoja 4 Susiję straipsniai Visos 2 versijos HTML kopija

Kurti įspėjimą

Cituoti

Išplėstinė paieška

Išsaugota skiltyje „Mano biblioteka“

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers

Autoregressive speech synthesis without vector quantization

Cosyvoice 2: Scalable streaming speech synthesis with large language models

Scaling speech-text pre-training with synthetic interleaved data

Investigating neural audio codecs for speech language model-based speech generation

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training