Mulan: A joint embedding of music audio and natural language

Q Huang, A Jansen, J Lee, R Ganti, JY Li… - arxiv preprint arxiv …, 2022 - arxiv.org
Music tagging and content-based retrieval systems have traditionally been constructed
using pre-defined ontologies covering a rigid set of music attributes or text queries. This …

Wav2clip: Learning robust audio representations from clip

HH Wu, P Seetharaman, K Kumar… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
We propose Wav2CLIP, a robust audio representation learning method by distilling from
Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on …

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Mert: Acoustic music understanding model with large-scale self-supervised training

Y Li, R Yuan, G Zhang, Y Ma, X Chen, H Yin… - arxiv preprint arxiv …, 2023 - arxiv.org
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training
generalisable models on large-scale data in the fields of vision, text, and speech. Although …

Multimodal pretraining, adaptation, and generation for recommendation: A survey

Q Liu, J Zhu, Y Yang, Q Dai, Z Du, XM Wu… - Proceedings of the 30th …, 2024 - dl.acm.org
Personalized recommendation serves as a ubiquitous channel for users to discover
information tailored to their interests. However, traditional recommendation models primarily …

Codified audio language modeling learns useful representations for music information retrieval

R Castellon, C Donahue, P Liang - arxiv preprint arxiv:2107.05677, 2021 - arxiv.org
We demonstrate that language models pre-trained on codified (discretely-encoded) music
audio learn representations that are useful for downstream MIR tasks. Specifically, we …

Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation

D Niizumi, D Takeuchi, Y Ohishi… - … Evaluation of Audio …, 2022 - proceedings.mlr.press
Recent general-purpose audio representations show state-of-the-art performance on
various audio tasks. These representations are pre-trained by self-supervised learning …

Contrastive audio-language learning for music

I Manco, E Benetos, E Quinton, G Fazekas - arxiv preprint arxiv …, 2022 - arxiv.org
As one of the most intuitive interfaces known to humans, natural language has the potential
to mediate many tasks that involve human-computer interaction, especially in application …

Towards learning universal audio representations

L Wang, P Luc, Y Wu, A Recasens… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
The ability to learn universal audio representations that can solve diverse speech, music,
and environment tasks can spur many applications that require general sound content …

Marble: Music audio representation benchmark for universal evaluation

R Yuan, Y Ma, Y Li, G Zhang, X Chen… - Advances in …, 2023 - proceedings.neurips.cc
In the era of extensive intersection between art and Artificial Intelligence (AI), such as image
generation and fiction co-creation, AI for music remains relatively nascent, particularly in …