Deep audio-visual learning: A survey
Audio-visual learning, aimed at exploiting the relationship between audio and visual
modalities, has drawn considerable attention since deep learning started to be used …
modalities, has drawn considerable attention since deep learning started to be used …
Foundation models for music: A survey
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …
Auto-regressive image synthesis with integrated quantization
Deep generative models have achieved conspicuous progress in realistic image synthesis
with multifarious conditional inputs, while generating diverse yet high-fidelity images …
with multifarious conditional inputs, while generating diverse yet high-fidelity images …
Music gesture for visual sound separation
Recent deep learning approaches have achieved impressive performance on visual sound
separation tasks. However, these approaches are mostly built on appearance and optical …
separation tasks. However, these approaches are mostly built on appearance and optical …
The sound of motions
Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact
that humans is capable of interpreting sound sources from how objects move visually, we …
that humans is capable of interpreting sound sources from how objects move visually, we …
Foley music: Learning to generate music from videos
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a
silent video clip about people playing musical instruments. We first identify two key …
silent video clip about people playing musical instruments. We first identify two key …
MT3: Multi-task multitrack music transcription
Automatic Music Transcription (AMT), inferring musical notes from raw audio, is a
challenging task at the core of music understanding. Unlike Automatic Speech Recognition …
challenging task at the core of music understanding. Unlike Automatic Speech Recognition …
Taming visually guided sound generation
Recent advances in visually-induced audio generation are based on sampling short, low-
fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the …
fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the …
Multi-instrument music synthesis with spectrogram diffusion
An ideal music synthesizer should be both interactive and expressive, generating high-
fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural …
fidelity audio in realtime for arbitrary combinations of instruments and notes. Recent neural …
Giantmidi-piano: A large-scale midi dataset for classical piano music
Symbolic music datasets are important for music information retrieval and musical analysis.
However, there is a lack of large-scale symbolic datasets for classical piano music. In this …
However, there is a lack of large-scale symbolic datasets for classical piano music. In this …