Learning Spatially-Aware Language and Audio Embedding

B Devnani, S Seto, Z Aldeneh, A Toso… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans can picture a sound scene given an imprecise natural language description. For
example, it is easy to imagine an acoustic environment given a phrase like" the lion roar …

The Interpretation Gap in Text-to-Music Generation Models

Y Zang, Y Zhang - arxiv preprint arxiv:2407.10328, 2024 - arxiv.org
Large-scale text-to-music generation models have significantly enhanced music creation
capabilities, offering unprecedented creative freedom. However, their ability to collaborate …