[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

XLS-R: Self-supervised cross-lingual speech representation learning at scale

A Babu, C Wang, A Tjandra, K Lakhotia, Q Xu… - arxiv preprint arxiv …, 2021 - arxiv.org
This paper presents XLS-R, a large-scale model for cross-lingual speech representation
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …

Unsupervised cross-lingual representation learning for speech recognition

A Conneau, A Baevski, R Collobert… - arxiv preprint arxiv …, 2020 - arxiv.org
This paper presents XLSR which learns cross-lingual speech representations by pretraining
a single model from the raw waveform of speech in multiple languages. We build on …

Applying wav2vec2. 0 to speech recognition in various low-resource languages

C Yi, J Wang, N Cheng, S Zhou, B Xu - arxiv preprint arxiv:2012.12121, 2020 - arxiv.org
There are several domains that own corresponding widely used feature extractors, such as
ResNet, BERT, and GPT-x. These models are usually pre-trained on large amounts of …

Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: An overview

W Du, Y Maimaitiyiming, M Nijat, L Li, A Hamdulla… - Applied Sciences, 2022 - mdpi.com
With the emergence of deep learning, the performance of automatic speech recognition
(ASR) systems has remarkably improved. Especially for resource-rich languages such as …

Multilingual end-to-end speech translation

H Inaguma, K Duh, T Kawahara… - 2019 IEEE Automatic …, 2019 - ieeexplore.ieee.org
In this paper, we propose a simple yet effective framework for multilingual end-to-end
speech translation (ST), in which speech utterances in source languages are directly …

Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning

C Chen, Y Hu, Q Zhang, H Zou, B Zhu… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating
the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and …

Massively multilingual adversarial speech recognition

O Adams, M Wiesner, S Watanabe… - arxiv preprint arxiv …, 2019 - arxiv.org
We report on adaptation of multilingual end-to-end speech recognition models trained on as
many as 100 languages. Our findings shed light on the relative importance of similarity …

Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages

K Azizah, M Adriani, W Jatmiko - IEEE Access, 2020 - ieeexplore.ieee.org
This work applies a hierarchical transfer learning to implement deep neural network (DNN)-
based multilingual text-to-speech (TTS) for low-resource languages. DNN-based system …

Xtreme-s: Evaluating cross-lingual speech representations

A Conneau, A Bapna, Y Zhang, M Ma… - arxiv preprint arxiv …, 2022 - arxiv.org
We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech
representations in many languages. XTREME-S covers four task families: speech …