Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

MUGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models

S Liu, AS Hussain, C Sun, Y Shan - arxiv preprint arxiv:2311.11255, 2023 - arxiv.org
The current landscape of research leveraging large language models (LLMs) is
experiencing a surge. Many works harness the powerful reasoning capabilities of these …

Zero-shot unsupervised and text-based audio editing using DDPM inversion

H Manor, T Michaeli - arxiv preprint arxiv:2402.10009, 2024 - arxiv.org
Editing signals using large pre-trained models, in a zero-shot manner, has recently seen
rapid advancements in the image domain. However, this wave has yet to reach the audio …

Musicmagus: Zero-shot text-to-music editing via diffusion models

Y Zhang, Y Ikemiya, G **a, N Murata… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in text-to-music generation models have opened new avenues in musical
creativity. However, music generation usually involves iterative refinements, and how to edit …

Loop copilot: Conducting ai ensembles for music generation and iterative editing

Y Zhang, A Maezawa, G **a, K Yamamoto… - arxiv preprint arxiv …, 2023 - arxiv.org
Creating music is iterative, requiring varied methods at each stage. However, existing AI
music systems fall short in orchestrating multiple subsystems for diverse needs. To address …

Instructspeech: Following speech editing instructions via large language models

R Huang, R Hu, Y Wang, Z Wang, X Cheng… - … on Machine Learning, 2024 - openreview.net
Instruction-guided speech editing aims to follow the user's natural language instruction to
manipulate the semantic and acoustic attributes of a speech. In this work, we construct triplet …

Cocola: Coherence-oriented contrastive learning of musical audio representations

R Ciranni, G Mariani, M Mancusi, E Postolache… - arxiv preprint arxiv …, 2024 - arxiv.org
We present COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive
learning method for musical audio representations that captures the harmonic and rhythmic …

Generalized multi-source inference for text conditioned music diffusion models

E Postolache, G Mariani, L Cosmo… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks:
generating a set of coherent sources, creating accompaniments, and performing source …

Instruction-guided editing controls for images and multimedia: A survey in llm era

TT Nguyen, Z Ren, T Pham, PL Nguyen, H Yin… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of large language models (LLMs) and multimodal learning has
transformed digital content creation and manipulation. Traditional visual editing tools require …

St-ito: Controlling audio effects for style transfer with inference-time optimization

CJ Steinmetz, S Singh, M Comunità, I Ibnyahya… - arxiv preprint arxiv …, 2024 - arxiv.org
Audio production style transfer is the task of processing an input to impart stylistic elements
from a reference recording. Existing approaches often train a neural network to estimate …