Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Mert: Acoustic music understanding model with large-scale self-supervised training

Y Li, R Yuan, G Zhang, Y Ma, X Chen, H Yin… - arxiv preprint arxiv …, 2023 - arxiv.org
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training
generalisable models on large-scale data in the fields of vision, text, and speech. Although …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Marble: Music audio representation benchmark for universal evaluation

R Yuan, Y Ma, Y Li, G Zhang, X Chen… - Advances in …, 2024 - proceedings.neurips.cc
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image
generation and fiction co-creation, AI for music remains relatively nascent, particularly in …

Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt

L Zhuo, R Yuan, J Pan, Y Ma, Y Li, G Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription
method achieving state-of-the-art performance on various lyrics transcription datasets, even …

On the effectiveness of speech self-supervised learning for music

Y Ma, R Yuan, Y Li, G Zhang, X Chen, H Yin… - arxiv preprint arxiv …, 2023 - arxiv.org
Self-supervised learning (SSL) has shown promising results in various speech and natural
language processing applications. However, its efficacy in music information retrieval (MIR) …

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

LI Yizhi, R Yuan, G Zhang, Y Ma, X Chen… - The Twelfth …, 2023 - openreview.net
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training
generalisable models on large-scale data in the fields of vision, text, and speech. Although …

Learning music representations with wav2vec 2.0

A Ragano, E Benetos, A Hines - 2023 31st Irish Conference on …, 2023 - ieeexplore.ieee.org
Learning music representations that are general-purpose offers the flexibility to finetune
several downstream tasks using smaller datasets. The wav2vec 2.0 speech representation …

MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization

H Zhu, Y Zhou, H Chen, J Yu, Z Ma, R Gu… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent years have witnessed the success of foundation models pre-trained with self-
supervised learning (SSL) in various music informatics understanding tasks, including music …

Unsupervised Musical Object Discovery from Audio

J Gha, V Herrmann, B Grewe, J Schmidhuber… - arxiv preprint arxiv …, 2023 - arxiv.org
Current object-centric learning models such as the popular SlotAttention architecture allow
for unsupervised visual scene decomposition. Our novel MusicSlots method adapts …