Audiobox: Unified audio generation with natural language prompts

A Vyas, B Shi, M Le, A Tjandra, YC Wu, B Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Audio is an essential part of our life, but creating it often requires expertise and is time-
consuming. Research communities have made great progress over the past year advancing …

Natural language guidance of high-fidelity text-to-speech with synthetic annotations

D Lyth, S King - arxiv preprint arxiv:2402.01912, 2024 - arxiv.org
Text-to-speech models trained on large-scale datasets have demonstrated impressive in-
context learning capabilities and naturalness. However, control of speaker identity and style …

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

J Hwang, M Hira, C Chen, X Zhang, Z Ni… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …

Spectral codecs: Spectrogram-based audio codecs for high quality speech synthesis

R Langman, A Jukić, K Dhawan, NR Koluguri… - arxiv preprint arxiv …, 2024 - arxiv.org
Historically, most speech models in machine-learning have used the mel-spectrogram as a
speech representation. Recently, discrete audio tokens produced by neural audio codecs …

Self-supervised speech quality estimation and enhancement using only clean speech

SW Fu, KH Hung, Y Tsao, YCF Wang - arxiv preprint arxiv:2402.16321, 2024 - arxiv.org
Speech quality estimation has recently undergone a paradigm shift from human-hearing
expert designs to machine-learning models. However, current models rely mainly on …

Speechprompt: Prompting speech language models for speech processing tasks

KW Chang, H Wu, YK Wang, YK Wu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Prompting has become a practical method for utilizing pre-trained language models (LMs).
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …

Low frame-rate speech codec: a codec designed for fast high-quality speech llm training and inference

E Casanova, R Langman, P Neekhara… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have significantly advanced audio processing through
audio codecs that convert audio into discrete tokens, enabling the application of language …

Generative speech foundation model pretraining for high-quality speech extraction and restoration

PJ Ku, AH Liu, R Korostik, SF Huang, SW Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper proposes a generative pretraining foundation model for high-quality speech
restoration tasks. By directly operating on complex-valued short-time Fourier transform …

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

J Shi, H Shim, J Tian, S Arora, H Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for
various speech, audio, and music signals. The toolkit features a Pythonic interface with …

Av2wav: Diffusion-based re-synthesis from continuous self-supervised features for audio-visual speech enhancement

JC Chou, CM Chien, K Livescu - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Speech enhancement systems are typically trained using pairs of clean and noisy speech. In
audio-visual speech enhancement (AVSE), there is not as much ground-truth clean data …