Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arxiv preprint arxiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Hybrid transformers for music source separation

S Rouard, F Massa, A Défossez - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
A natural question arising in Music Source Separation (MSS) is whether long range
contextual information is useful, or whether local acoustic features are sufficient. In other …

Music source separation with band-split RNN

Y Luo, J Yu - IEEE/ACM Transactions on Audio, Speech, and …, 2023 - ieeexplore.ieee.org
The performance of music source separation (MSS) models has been greatly improved in
recent years thanks to the development of novel neural network architectures and training …

Music demixing challenge 2021

Y Mitsufuji, G Fabbro, S Uhlich, FR Stöter… - Frontiers in Signal …, 2022 - frontiersin.org
Music source separation has been intensively studied in the last decade and tremendous
progress with the advent of deep learning could be observed. Evaluation campaigns such …

Multi-source diffusion models for simultaneous music generation and separation

G Mariani, I Tallini, E Postolache, M Mancusi… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we define a diffusion-based generative model capable of both music synthesis
and source separation by learning the score of the joint probability density of sources …

Waveform-domain speech enhancement using spectrogram encoding for robust speech recognition

H Shi, M Mimura, T Kawahara - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
While waveform-domain speech enhancement (SE) has been extensively investigated in
recent years and achieves state-of-the-art performance in many datasets, spectrogram …

Songcreator: Lyrics-based universal song generation

S Lei, Y Zhou, B Tang, MWY Lam… - Advances in …, 2025 - proceedings.neurips.cc
Music is an integral part of human culture, embodying human intelligence and creativity, of
which songs compose an essential part. While various aspects of song generation have …

Aero: Audio super resolution in the spectral domain

M Mandel, O Tal, Y Adi - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
We present AERO, a audio super-resolution model that processes speech and music
signals in the spectral domain. AERO is based on an encoder-decoder architecture with …

The Sound Demixing Challenge 2023$\unicode {x2013} $ Music Demixing Track

G Fabbro, S Uhlich, CH Lai, W Choi… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper summarizes the music demixing (MDX) track of the Sound Demixing Challenge
(SDX'23). We provide a summary of the challenge setup and introduce the task of robust …

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

J Hwang, M Hira, C Chen, X Zhang, Z Ni… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …