[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

Self-supervised representation learning: Introduction, advances, and challenges

L Ericsson, H Gouk, CC Loy… - IEEE Signal Processing …, 2022 - ieeexplore.ieee.org
Self-supervised representation learning (SSRL) methods aim to provide powerful, deep
feature learning without the requirement of large annotated data sets, thus alleviating the …

Scaling speech technology to 1,000+ languages

V Pratap, A Tjandra, B Shi, P Tomasello, A Babu… - Journal of Machine …, 2024 - jmlr.org
Expanding the language coverage of speech technology has the potential to improve
access to information for many more people. However, current speech technology is …

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

Modular deep learning

J Pfeiffer, S Ruder, I Vulić, EM Ponti - arxiv preprint arxiv:2302.11529, 2023 - arxiv.org
Transfer learning has recently become the dominant paradigm of machine learning. Pre-
trained models fine-tuned for downstream tasks achieve better performance with fewer …

Unsupervised cross-lingual representation learning for speech recognition

A Conneau, A Baevski, R Collobert… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper presents XLSR which learns cross-lingual speech representations by pretraining
a single model from the raw waveform of speech in multiple languages. We build on …

Exploring wav2vec 2.0 on speaker verification and language identification

Z Fan, M Li, S Zhou, B Xu - arxiv preprint arxiv:2012.06185, 2020 - arxiv.org
Wav2vec 2.0 is a recently proposed self-supervised framework for speech representation
learning. It follows a two-stage training process of pre-training and fine-tuning, and performs …

mslam: Massively multilingual joint pre-training for speech and text

A Bapna, C Cherry, Y Zhang, Y Jia, M Johnson… - arxiv preprint arxiv …, 2022 - arxiv.org
We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual
cross-modal representations of speech and text by pre-training jointly on large amounts of …

Improving continuous sign language recognition with cross-lingual signs

F Wei, Y Chen - Proceedings of the IEEE/CVF international …, 2023 - openaccess.thecvf.com
This work dedicates to continuous sign language recognition (CSLR), which is a weakly
supervised task dealing with the recognition of continuous signs from videos, without any …

Efficient adapter transfer of self-supervised speech models for automatic speech recognition

B Thomas, S Kessler, S Karout - ICASSP 2022-2022 IEEE …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) is a powerful tool that allows learning of underlying
representations from unlabeled data. Transformer based models such as wav2vec 2.0 and …