A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
Measuring disentanglement: A review of metrics
Learning to disentangle and represent factors of variation in data is an important problem in
artificial intelligence. While many advances have been made to learn these representations …
artificial intelligence. While many advances have been made to learn these representations …
Voicebox: Text-guided multilingual universal speech generation at scale
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …
community. These models not only generate high fidelity outputs, but are also generalists …
Libritts: A corpus derived from librispeech for text-to-speech
This paper introduces a new speech corpus called" LibriTTS" designed for text-to-speech
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …
Unsupervised speech representation learning using wavenet autoencoders
We consider the task of unsupervised extraction of meaningful latent representations of
speech by applying autoencoding neural networks to speech waveforms. The goal is to …
speech by applying autoencoding neural networks to speech waveforms. The goal is to …
Meta-stylespeech: Multi-speaker adaptive text-to-speech generation
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation
is now in high demand for many applications. For practical applicability, a TTS model should …
is now in high demand for many applications. For practical applicability, a TTS model should …
ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-
TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit …
TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit …
Audiobox: Unified audio generation with natural language prompts
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …
consuming. Research communities have made great progress over the past year advancing …
Lipsync3d: Data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization
In this paper, we present a video-based learning framework for animating personalized 3D
talking faces from audio. We introduce two training-time data normalizations that significantly …
talking faces from audio. We introduce two training-time data normalizations that significantly …
ZeroEGGS: Zero‐shot Example‐based Gesture Generation from Speech
We present ZeroEGGS, a neural network framework for speech‐driven gesture generation
with zero‐shot style control by example. This means style can be controlled via only a short …
with zero‐shot style control by example. This means style can be controlled via only a short …