- Academic Search

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Zapisz Cytuj Cytowane przez 5 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llama-omni: Seamless speech interaction with large language models

Q Fang, S Guo, Y Zhou, Z Ma, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Models like GPT-4o enable real-time interaction with large language models (LLMs) through
speech, significantly enhancing user experience compared to traditional text-based …

Zapisz Cytuj Cytowane przez 34 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

Zapisz Cytuj Cytowane przez 14 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Zapisz Cytuj Cytowane przez 20 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Y Wang, H Zhan, L Liu, R Zeng, H Guo, J Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive
and non-autoregressive systems. The autoregressive systems implicitly model duration but …

Zapisz Cytuj Cytowane przez 15 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Zapisz Cytuj Cytowane przez 2 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Z **e, C Wu - arxiv preprint arxiv:2410.11190, 2024 - arxiv.org

GPT-4o, an all-encompassing model, represents a milestone in the development of large
multi-modal language models. It can understand visual, auditory, and textual modalities …

Zapisz Cytuj Cytowane przez 17 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

HH Guo, K Liu, FY Shen, YC Wu, FL **e, K **e… - arxiv preprint arxiv …, 2024 - arxiv.org

This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the
growing demands for personalized and diverse generative speech applications. The …

Zapisz Cytuj Cytowane przez 12 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Zapisz Cytuj Cytowane przez 7 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cosyvoice 2: Scalable streaming speech synthesis with large language models

Z Du, Y Wang, Q Chen, X Shi, X Lv, T Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

In our previous work, we introduced CosyVoice, a multilingual speech synthesis model
based on supervised discrete speech tokens. By employing progressive semantic decoding …

Zapisz Cytuj Cytowane przez 5 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised...

Recent advances in speech language models: A survey

Llama-omni: Seamless speech interaction with large language models

Emova: Empowering language models to see, hear and speak with vivid emotions

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Wavchat: A survey of spoken dialogue models

Cosyvoice 2: Scalable streaming speech synthesis with large language models