Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Salmonn: Towards generic hearing abilities for large language models

C Tang, W Yu, G Sun, X Chen, T Tan, W Li, L Lu… - arxiv preprint arxiv …, 2023 - arxiv.org
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical
world, which refers to the perception and understanding of general auditory information …

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

H Liu, Y Yuan, X Liu, X Mei, Q Kong… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Q Liu, J Zhu, Y Yang, Q Dai, Z Du, XM Wu… - Proceedings of the 30th …, 2024 - dl.acm.org
Personalized recommendation serves as a ubiquitous channel for users to discover
information tailored to their interests. However, traditional recommendation models primarily …

Marble: Music audio representation benchmark for universal evaluation

R Yuan, Y Ma, Y Li, G Zhang, X Chen… - Advances in …, 2023 - proceedings.neurips.cc
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image
generation and fiction co-creation, AI for music remains relatively nascent, particularly in …

MUGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

S Liu, AS Hussain, C Sun, Y Shan - arxiv preprint arxiv:2311.11255, 2023 - arxiv.org
The current landscape of research leveraging large language models (LLMs) is
experiencing a surge. Many works harness the powerful reasoning capabilities of these …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Music understanding LLaMA: Advancing text-to-music generation with question answering and captioning

S Liu, AS Hussain, C Sun… - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale
publicly available music datasets with natural language captions. To address this, we …

Muchomusic: Evaluating music understanding in multimodal audio-language models

B Weck, I Manco, E Benetos, E Quinton… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal models that jointly process audio and language hold great promise in audio
understanding and are increasingly being adopted in the music domain. By allowing users …

Adapting frechet audio distance for generative music evaluation

A Gui, H Gamper, S Braun… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …