A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023‏ - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Measuring disentanglement: A review of metrics

MA Carbonneau, J Zaidi, J Boilard… - IEEE transactions on …, 2022‏ - ieeexplore.ieee.org
Learning to disentangle and represent factors of variation in data is an important problem in
artificial intelligence. While many advances have been made to learn these representations …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024‏ - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

Libritts: A corpus derived from librispeech for text-to-speech

H Zen, V Dang, R Clark, Y Zhang, RJ Weiss… - arxiv preprint arxiv …, 2019‏ - arxiv.org
This paper introduces a new speech corpus called" LibriTTS" designed for text-to-speech
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …

Unsupervised speech representation learning using wavenet autoencoders

J Chorowski, RJ Weiss, S Bengio… - … /ACM transactions on …, 2019‏ - ieeexplore.ieee.org
We consider the task of unsupervised extraction of meaningful latent representations of
speech by applying autoencoding neural networks to speech waveforms. The goal is to …

Meta-stylespeech: Multi-speaker adaptive text-to-speech generation

D Min, DB Lee, E Yang… - … Conference on Machine …, 2021‏ - proceedings.mlr.press
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation
is now in high demand for many applications. For practical applicability, a TTS model should …

ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit

T Hayashi, R Yamamoto, K Inoue… - ICASSP 2020-2020 …, 2020‏ - ieeexplore.ieee.org
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-
TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit …

Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization

A Lahiri, V Kwatra, C Frueh, J Lewis… - Proceedings of the …, 2021‏ - openaccess.thecvf.com
In this paper, we present a video-based learning framework for animating personalized 3D
talking faces from audio. We introduce two training-time data normalizations that significantly …

ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech

S Ghorbani, Y Ferstl, D Holden, NF Troje… - Computer Graphics …, 2023‏ - Wiley Online Library
We present ZeroEGGS, a neural network framework for speech‐driven gesture generation
with zero‐shot style control by example. This means style can be controlled via only a short …