A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Spoken language interaction with robots: Recommendations for future research

M Marge, C Espy-Wilson, NG Ward, A Alwan… - Computer Speech & …, 2022 - Elsevier
With robotics rapidly advancing, more effective human–robot interaction is increasingly
needed to realize the full potential of robots for society. While spoken language must be part …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arxiv preprint arxiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

Noise2music: Text-conditioned music generation with diffusion models

Q Huang, DS Park, T Wang, TI Denk, A Ly… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Noise2Music, where a series of diffusion models is trained to generate high-
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Naturalspeech: End-to-end text-to-speech synthesis with human-level quality

X Tan, J Chen, H Liu, J Cong, C Zhang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …

[PDF][PDF] Jukebox: A generative model for music

P Dhariwal, H Jun, C Payne, JW Kim… - arxiv preprint arxiv …, 2020 - assets.pubpub.org
We introduce Jukebox, a model that generates music with singing in the raw audio domain.
We tackle the long context of raw audio using a multiscale VQ-VAE to compress it to discrete …

Glow-tts: A generative flow for text-to-speech via monotonic alignment search

J Kim, S Kim, J Kong, S Yoon - Advances in Neural …, 2020 - proceedings.neurips.cc
Abstract Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been
proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the …

Add 2022: the first audio deep synthesis detection challenge

J Yi, R Fu, J Tao, S Nie, H Ma, C Wang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021.
However, the recent shared tasks have not covered many real-life and challenging …