- Academic Search

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Zapisz Cytuj Cytowane przez 12 Powiązane artykuły Wszystkie wersje 4 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

N Majumder, CY Hung, D Ghosal, WN Hsu… - Proceedings of the …, 2024 - dl.acm.org

Generative multimodal content is increasingly prevalent in much of the content creation
arena, as it has the potential to allow artists and media personnel to create pre-production …

Zapisz Cytuj Cytowane przez 37 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Z Du, J Wang, Q Chen, Y Chu, Z Gao, Z Li, K Hu… - arxiv preprint arxiv …, 2023 - arxiv.org

Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …

Zapisz Cytuj Cytowane przez 44 Powiązane artykuły Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …

Zapisz Cytuj Cytowane przez 22 Powiązane artykuły Wszystkie wersje 3 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - IEEE …, 2025 - ieeexplore.ieee.org

Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …

Zapisz Cytuj Cytowane przez 2 Powiązane artykuły Wszystkie wersje 3

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org

Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

Zapisz Cytuj Cytowane przez 12 Powiązane artykuły Wszystkie wersje 2

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Picoaudio: Enabling precise timestamp and frequency controllability of audio events in text-to-audio generation

Z **e, X Xu, Z Wu, M Wu - arxiv preprint arxiv:2407.02869, 2024 - arxiv.org

Recently, audio generation tasks have attracted considerable research interests. Precise
temporal controllability is essential to integrate audio generation with real applications. In …

Zapisz Cytuj Cytowane przez 11 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arxiv preprint arxiv …, 2024 - arxiv.org

Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Zapisz Cytuj Cytowane przez 8 Powiązane artykuły Wszystkie wersje 5 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

Zapisz Cytuj Cytowane przez 7 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

X Zhu, W Tian, X Wang, L He, Y **ao, X Wang… - Proceedings of the …, 2024 - dl.acm.org

Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …

Zapisz Cytuj Cytowane przez 5 Powiązane artykuły Wszystkie wersje 2

Utwórz alert

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Audiobox: Unified audio generation with natural language prompts

Foundation models for music: A survey

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Autoregressive speech synthesis without vector quantization

Audio-Language Datasets of Scenes and Events: A Survey

Flashspeech: Efficient zero-shot speech synthesis

Picoaudio: Enabling precise timestamp and frequency controllability of audio events in text-to-audio generation

Recommendation with generative models

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis