- Academic Search

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Gem Citer Citeret af 5 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llama-omni: Seamless speech interaction with large language models

Q Fang, S Guo, Y Zhou, Z Ma, S Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Models like GPT-4o enable real-time interaction with large language models (LLMs) through
speech, significantly enhancing user experience compared to traditional text-based …

Gem Citer Citeret af 38 Relaterede artikler Alle 4 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Emova: Empowering language models to see, hear and speak with vivid emotions

K Chen, Y Gou, R Huang, Z Liu, D Tan, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and
tones, marks a milestone for omni-modal foundation models. However, empowering Large …

Gem Citer Citeret af 15 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Gem Citer Citeret af 21 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Y Wang, H Zhan, L Liu, R Zeng, H Guo, J Zheng… - arxiv preprint arxiv …, 2024 - arxiv.org

The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive
and non-autoregressive systems. The autoregressive systems implicitly model duration but …

Gem Citer Citeret af 15 Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org

Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Gem Citer Citeret af 3 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Z **e, C Wu - arxiv preprint arxiv:2410.11190, 2024 - arxiv.org

GPT-4o, an all-encompassing model, represents a milestone in the development of large
multi-modal language models. It can understand visual, auditory, and textual modalities …

Gem Citer Citeret af 18 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Songcreator: Lyrics-based universal song generation

S Lei, Y Zhou, B Tang, MWY Lam… - Advances in …, 2025 - proceedings.neurips.cc

Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …

Gem Citer Citeret af 3 Relaterede artikler Alle 5 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

HH Guo, K Liu, FY Shen, YC Wu, FL **e, K **e… - arxiv preprint arxiv …, 2024 - arxiv.org

This work proposes FireRedTTS, a foundation text-to-speech framework, to meet the
growing demands for personalized and diverse generative speech applications. The …

Gem Citer Citeret af 12 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

Gem Citer Citeret af 8 Relaterede artikler Alle 2 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised...

Recent advances in speech language models: A survey

Llama-omni: Seamless speech interaction with large language models

Emova: Empowering language models to see, hear and speak with vivid emotions

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Songcreator: Lyrics-based universal song generation

Fireredtts: A foundation text-to-speech framework for industry-level generative speech applications

Wavchat: A survey of spoken dialogue models