Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

N Majumder, CY Hung, D Ghosal, WN Hsu… - Proceedings of the …, 2024 - dl.acm.org
Generative multimodal content is increasingly prevalent in much of the content creation
arena, as it has the potential to allow artists and media personnel to create pre-production …

Lauragpt: Listen, attend, understand, and regenerate audio with gpt

Z Du, J Wang, Q Chen, Y Chu, Z Gao, Z Li, K Hu… - arxiv preprint arxiv …, 2023 - arxiv.org
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance
on various natural language processing tasks, and have shown great potential as …

Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …

Audio-Language Datasets of Scenes and Events: A Survey

G Wijngaard, E Formisano, M Esposito… - IEEE …, 2025 - ieeexplore.ieee.org
Audio-language models (ALMs) generate linguistic descriptions of sound-producing events
and scenes. Advances in dataset creation and computational power have led to significant …

Flashspeech: Efficient zero-shot speech synthesis

Z Ye, Z Ju, H Liu, X Tan, J Chen, Y Lu, P Sun… - Proceedings of the …, 2024 - dl.acm.org
Recent progress in large-scale zero-shot speech synthesis has been significantly advanced
by language models and diffusion models. However, the generation process of both …

Picoaudio: Enabling precise timestamp and frequency controllability of audio events in text-to-audio generation

Z **e, X Xu, Z Wu, M Wu - arxiv preprint arxiv:2407.02869, 2024 - arxiv.org
Recently, audio generation tasks have attracted considerable research interests. Precise
temporal controllability is essential to integrate audio generation with real applications. In …

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Controlspeech: Towards simultaneous zero-shot speaker cloning and zero-shot language style control with decoupled codec

S Ji, J Zuo, W Wang, M Fang, S Zheng, Q Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully
cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style …

Unistyle: Unified style modeling for speaking style captioning and stylistic speech synthesis

X Zhu, W Tian, X Wang, L He, Y **ao, X Wang… - Proceedings of the …, 2024 - dl.acm.org
Understanding the speaking style, such as the emotion of the interlocutor's speech, and
responding with speech in an appropriate style is a natural occurrence in human …