A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023‏ - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

On the challenges and opportunities in generative ai

L Manduchi, K Pandey, R Bamler, R Cotterell… - arxiv preprint arxiv …, 2024‏ - arxiv.org
The field of deep generative modeling has grown rapidly and consistently over the years.
With the availability of massive amounts of training data coupled with advances in scalable …

Audioldm: Text-to-audio generation with latent diffusion models

H Liu, Z Chen, Y Yuan, X Mei, X Liu, D Mandic… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …

Audioldm 2: Learning holistic audio generation with self-supervised pretraining

H Liu, Y Yuan, X Liu, X Mei, Q Kong… - … on Audio, Speech …, 2024‏ - ieeexplore.ieee.org
Although audio generation shares commonalities across different types of audio, such as
speech, music, and sound effects, designing models for each type requires careful …

Diffsound: Discrete diffusion model for text-to-sound generation

D Yang, J Yu, H Wang, W Wang… - … on Audio, Speech …, 2023‏ - ieeexplore.ieee.org
Generating sound effects that people want is an important topic. However, there are limited
studies in this area for sound generation. In this study, we investigate generating sound …

Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models

YA Li, C Han, V Raghavan… - Advances in Neural …, 2023‏ - proceedings.neurips.cc
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style
diffusion and adversarial training with large speech language models (SLMs) to achieve …

Bigvgan: A universal neural vocoder with large-scale training

S Lee, W **, B Ginsburg, B Catanzaro… - arxiv preprint arxiv …, 2022‏ - arxiv.org
Despite recent progress in generative adversarial network (GAN)-based vocoders, where
the model generates raw waveform conditioned on acoustic features, it is challenging to …

Deblurring via stochastic refinement

J Whang, M Delbracio, H Talebi… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Image deblurring is an ill-posed problem with multiple plausible solutions for a given input
image. However, most existing methods produce a deterministic estimate of the clean image …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021‏ - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Lossy image compression with conditional diffusion models

R Yang, S Mandt - Advances in Neural Information …, 2023‏ - proceedings.neurips.cc
This paper outlines an end-to-end optimized lossy image compression framework using
diffusion generative models. The approach relies on the transform coding paradigm, where …