A review of differentiable digital signal processing for music and speech synthesis

B Hayes, J Shier, G Fazekas, A McPherson… - Frontiers in Signal …, 2024 - frontiersin.org
The term “differentiable digital signal processing” describes a family of techniques in which
loss function gradients are backpropagated through digital signal processors, facilitating …

The state of the art in procedural audio

D Menexopoulos, P Pestana, J Reiss - Journal of the Audio Engineering …, 2023 - aes.org
Procedural audio may be defined as real-time sound generation according to programmatic
rules and live input. It is often considered a subset of sound synthesis and is especially …

Adapting frechet audio distance for generative music evaluation

A Gui, H Gamper, S Braun… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
The growing popularity of generative music models underlines the need for perceptually
relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly …

Multi-modal latent diffusion

M Bounoua, G Franzese, P Michiardi - Entropy, 2024 - mdpi.com
Multimodal datasets are ubiquitous in modern applications, and multimodal Variational
Autoencoders are a popular family of models that aim to learn a joint representation of …

Configurable EBEN: Extreme bandwidth extension network to enhance body-conducted speech capture

J Hauret, T Joubaud, V Zimpfer… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
This article presents a configurable version of Extreme Bandwidth Extension Network
(EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with …

Siamese siren: Audio compression with implicit neural representations

LA Lanzendörfer, R Wattenhofer - arxiv preprint arxiv:2306.12957, 2023 - arxiv.org
Implicit Neural Representations (INRs) have emerged as a promising method for
representing diverse data modalities, including 3D shapes, images, and audio. While recent …

PAGURI: a user experience study of creative interaction with text-to-music models

F Ronchini, L Comanducci, G Perego… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, text-to-music models have been the biggest breakthrough in automatic
music generation. While they are unquestionably a showcase of technological progress, it is …

Latent space interpolation of synthesizer parameters using timbre-regularized auto-encoders

G Le Vaillant, T Dutoit - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Sound synthesizers are ubiquitous in modern music production but manipulating their
presets, ie the sets of synthesis parameters, demands expert skills. This study presents a …

What you hear is what you see: Audio quality metrics from image quality metrics

T Namgyal, A Hepburn, R Santos-Rodriguez… - arxiv preprint arxiv …, 2023 - arxiv.org
In this study, we investigate the feasibility of utilizing state-of-the-art image perceptual
metrics for evaluating audio signals by representing them as spectrograms. The …

[PDF][PDF] Conditional sound effects generation with regularized wgan

Y Liu, C ** - Proceedings of the Sound and Music Computing …, 2023 - researchgate.net
Over recent years generative models utilizing deep neural networks have demonstrated
outstanding capacity in synthesizing high-quality and plausible human speech and music …