A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

A comprehensive review of data‐driven co‐speech gesture generation

S Nyatsanga, T Kucherenko, C Ahuja… - Computer Graphics …, 2023 - Wiley Online Library
Gestures that accompany speech are an essential part of natural and efficient embodied
human communication. The automatic generation of such co‐speech gestures is a long …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2023 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

Symphonize 3d semantic scene completion with contextual instance queries

H Jiang, T Cheng, N Gao, H Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract 3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal
undertaking in autonomous driving aiming to predict the voxel occupancy within volumetric …

Noise2music: Text-conditioned music generation with diffusion models

Q Huang, DS Park, T Wang, TI Denk, A Ly… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Noise2Music, where a series of diffusion models is trained to generate high-
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …

Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers

K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin… - arxiv preprint arxiv …, 2023 - arxiv.org
Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is
important to capture the diversity in human speech such as speaker identities, prosodies …

NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality

X Tan, J Chen, H Liu, J Cong, C Zhang… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Nv-embed: Improved techniques for training llms as generalist embedding models

C Lee, R Roy, M Xu, J Raiman, M Shoeybi… - arxiv preprint arxiv …, 2024 - arxiv.org
Decoder-only large language model (LLM)-based embedding models are beginning to
outperform BERT or T5-based embedding models in general-purpose text embedding tasks …

Add 2022: the first audio deep synthesis detection challenge

J Yi, R Fu, J Tao, S Nie, H Ma, C Wang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021.
However, the recent shared tasks have not covered many real-life and challenging …