A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
Spoken language interaction with robots: Recommendations for future research
With robotics rapidly advancing, more effective human–robot interaction is increasingly
needed to realize the full potential of robots for society. While spoken language must be part …
needed to realize the full potential of robots for society. While spoken language must be part …
Voicebox: Text-guided multilingual universal speech generation at scale
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …
community. These models not only generate high fidelity outputs, but are also generalists …
Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …
important to capture the diversity in human speech such as speaker identities, prosodies …
Noise2music: Text-conditioned music generation with diffusion models
We introduce Noise2Music, where a series of diffusion models is trained to generate high-
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
Naturalspeech: End-to-end text-to-speech synthesis with human-level quality
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …
years. Some questions naturally arise that whether a TTS system can achieve human-level …
[PDF][PDF] Jukebox: A generative model for music
We introduce Jukebox, a model that generates music with singing in the raw audio domain.
We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete …
We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete …
Glow-tts: A generative flow for text-to-speech via monotonic alignment search
Abstract Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been
proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the …
proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the …
Add 2022: the first audio deep synthesis detection challenge
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021.
However, the recent shared tasks have not covered many real-life and challenging …
However, the recent shared tasks have not covered many real-life and challenging …