A review of differentiable digital signal processing for music and speech synthesis
The term “differentiable digital signal processing” describes a family of techniques in which
loss function gradients are backpropagated through digital signal processors, facilitating …
loss function gradients are backpropagated through digital signal processors, facilitating …
The state of the art in procedural audio
Procedural audio may be defined as real-time sound generation according to programmatic
rules and live input. It is often considered a subset of sound synthesis and is especially …
rules and live input. It is often considered a subset of sound synthesis and is especially …
Adapting frechet audio distance for generative music evaluation
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …
Multi-modal latent diffusion
Multimodal datasets are ubiquitous in modern applications, and multimodal Variational
Autoencoders are a popular family of models that aim to learn a joint representation of …
Autoencoders are a popular family of models that aim to learn a joint representation of …
Configurable EBEN: Extreme bandwidth extension network to enhance body-conducted speech capture
J Hauret, T Joubaud, V Zimpfer… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
This article presents a configurable version of Extreme Bandwidth Extension Network
(EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with …
(EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with …
Siamese siren: Audio compression with implicit neural representations
Implicit Neural Representations (INRs) have emerged as a promising method for
representing diverse data modalities, including 3D shapes, images, and audio. While recent …
representing diverse data modalities, including 3D shapes, images, and audio. While recent …
PAGURI: a user experience study of creative interaction with text-to-music models
In recent years, text-to-music models have been the biggest breakthrough in automatic
music generation. While they are unquestionably a showcase of technological progress, it is …
music generation. While they are unquestionably a showcase of technological progress, it is …
Latent space interpolation of synthesizer parameters using timbre-regularized auto-encoders
Sound synthesizers are ubiquitous in modern music production but manipulating their
presets, ie the sets of synthesis parameters, demands expert skills. This study presents a …
presets, ie the sets of synthesis parameters, demands expert skills. This study presents a …
What you hear is what you see: Audio quality metrics from image quality metrics
In this study, we investigate the feasibility of utilizing state-of-the-art image perceptual
metrics for evaluating audio signals by representing them as spectrograms. The …
metrics for evaluating audio signals by representing them as spectrograms. The …
[PDF][PDF] Conditional sound effects generation with regularized wgan
Over recent years generative models utilizing deep neural networks have demonstrated
outstanding capacity in synthesizing high-quality and plausible human speech and music …
outstanding capacity in synthesizing high-quality and plausible human speech and music …