A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2024 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

Textually pretrained speech language models

M Hassid, T Remez, TA Nguyen, I Gat… - Advances in …, 2024 - proceedings.neurips.cc
Speech language models (SpeechLMs) process and generate acoustic data only, without
textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using …

From discrete tokens to high-fidelity audio using multi-band diffusion

R San Roman, Y Adi, A Deleforge… - Advances in …, 2024 - proceedings.neurips.cc
Deep generative models can generate high-fidelity audio conditioned on varioustypes of
representations (eg, mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)) …

Audiotoken: Adaptation of text-conditioned diffusion models for audio-to-image generation

G Yariv, I Gat, L Wolf, Y Adi, I Schwartz - arxiv preprint arxiv:2305.13050, 2023 - arxiv.org
In recent years, image generation has shown a great leap in performance, where diffusion
models play a central role. Although generating high-quality images, such models are …

Last: Language model aware speech tokenization

A Turetzky, Y Adi - arxiv preprint arxiv:2409.03701, 2024 - arxiv.org
Speech tokenization serves as the foundation of speech language model (LM), enabling
them to perform various tasks such as spoken language modeling, text-to-speech, speech-to …

StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin

N Shah, N Sahipjohn, V Tambrahalli… - Proceedings of the …, 2024 - dl.acm.org
We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted
vibrations behind the ear into speech. This innovation is designed to improve social …