Scaling speech technology to 1,000+ languages

V Pratap, A Tjandra, B Shi, P Tomasello, A Babu… - Journal of Machine …, 2024 - jmlr.org
Expanding the language coverage of speech technology has the potential to improve
access to information for many more people. However, current speech technology is …

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

J Hwang, M Hira, C Chen, X Zhang, Z Ni… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims
to accelerate the research and development of audio and speech technologies by providing …

Pseudo-labeling for massively multilingual speech recognition

L Lugosch, T Likhomanenko… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art
monolingual speech recognition systems. In this work, we extend pseudo-labeling to …

Exploration on HuBERT with multiple resolutions

J Shi, Y Tang, H Inaguma, H Gong, J Pino… - arxiv preprint arxiv …, 2023 - arxiv.org
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in
speech processing. However, we argue that its fixed 20ms resolution for hidden …

Scaling a simple approach to zero-shot speech recognition

J Zhao, V Pratap, M Auli - arxiv preprint arxiv:2407.17852, 2024 - arxiv.org
Despite rapid progress in increasing the language coverage of automatic speech
recognition, the field is still far from covering all languages with a known writing script …

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

N San, M Bartelds, B Billings, E de Falco… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent research using pre-trained transformer models suggests that just 10 minutes of
transcribed speech may be enough to fine-tune such a model for automatic speech …

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition

N Rossenbach, R Schlüter, S Sakti - arxiv preprint arxiv:2407.21476, 2024 - arxiv.org
The rapid development of neural text-to-speech (TTS) systems enabled its usage in other
areas of natural language processing such as automatic speech recognition (ASR) or …

Av-cpl: Continuous pseudo-labeling for audio-visual speech recognition

A Rouditchenko, R Collobert… - arxiv preprint arxiv …, 2023 - arxiv.org
Audio-visual speech contains synchronized audio and visual information that provides cross-
modal supervision to learn representations for both automatic speech recognition (ASR) and …

EURO: ESPnet unsupervised asr open-source toolkit

D Gao, J Shi, SP Chuang, LP Garcia… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EURO), an end-
to-end open-source toolkit for unsupervised automatic speech recognition (UASR). EURO …

GPU-Accelerated Wfst Beam Search Decoder for CTC-Based Speech Recognition

D Galvez, T Kaldewey - 2023 IEEE Automatic Speech …, 2023 - ieeexplore.ieee.org
While Connectionist Temporal Classification (CTC) models deliver state-of-the-art accuracy
in automated speech recognition (ASR) pipelines, their performance has been limited by …