Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Salmonn: Towards generic hearing abilities for large language models
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical
world, which refers to the perception and understanding of general auditory information …
world, which refers to the perception and understanding of general auditory information …
Audioldm 2: Learning holistic audio generation with self-supervised pretraining
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …
speech, music, and sound effects, designing models for each type requires careful …
Multimodal pretraining, adaptation, and generation for recommendation: A survey
Personalized recommendation serves as a ubiquitous channel for users to discover
information tailored to their interests. However, traditional recommendation models primarily …
information tailored to their interests. However, traditional recommendation models primarily …
Marble: Music audio representation benchmark for universal evaluation
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image
generation and fiction co-creation, AI for music remains relatively nascent, particularly in …
generation and fiction co-creation, AI for music remains relatively nascent, particularly in …
MUGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models
The current landscape of research leveraging large language models (LLMs) is
experiencing a surge. Many works harness the powerful reasoning capabilities of these …
experiencing a surge. Many works harness the powerful reasoning capabilities of these …
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …
Music understanding LLaMA: Advancing text-to-music generation with question answering and captioning
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale
publicly available music datasets with natural language captions. To address this, we …
publicly available music datasets with natural language captions. To address this, we …
Muchomusic: Evaluating music understanding in multimodal audio-language models
Multimodal models that jointly process audio and language hold great promise in audio
understanding and are increasingly being adopted in the music domain. By allowing users …
understanding and are increasingly being adopted in the music domain. By allowing users …
Adapting frechet audio distance for generative music evaluation
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …