[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition

Y Zhang, DS Park, W Han, J Qin… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
We summarize the results of a host of efforts using giant automatic speech recognition (ASR)
models pre-trained using large, diverse unlabeled datasets containing approximately a …

End-to-end speech recognition: A survey

R Prabhavalkar, T Hori, TN Sainath… - … on Audio, Speech …, 2023 - ieeexplore.ieee.org
In the last decade of automatic speech recognition (ASR) research, the introduction of deep
learning has brought considerable reductions in word error rate of more than 50% relative …

A better and faster end-to-end model for streaming asr

B Li, A Gulati, J Yu, TN Sainath, CC Chiu… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for
streaming speech recognition [1] across many dimensions, including quality (as measured …

Dual-mode ASR: Unify and improve streaming ASR with full-context modeling

J Yu, W Han, A Gulati, CC Chiu, B Li… - arxiv preprint arxiv …, 2020 - arxiv.org
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as
quickly and accurately as possible, while full-context ASR waits for the completion of a full …

Joist: A joint speech and text streaming model for asr

TN Sainath, R Prabhavalkar, A Bapna… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E)
model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous …

How does pre-trained wav2vec 2.0 perform on domain-shifted asr? an extensive benchmark on air traffic control communications

J Zuluaga-Gomez, A Prasad… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled
speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine …

Streaming end-to-end multilingual speech recognition with joint language identification

C Zhang, B Li, T Sainath, T Strohman… - arxiv preprint arxiv …, 2022 - arxiv.org
Language identification is critical for many downstream tasks in automatic speech
recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an …

[PDF][PDF] An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling.

TN Sainath, Y He, A Narayanan, R Botros, R Pang… - Interspeech, 2021 - researchgate.net
On-device end-to-end (E2E) models have shown improvements over a conventional model
on Search test sets in both quality, as measured by Word Error Rate (WER)[1], and latency …

Pseudo label is better than human label

D Hwang, KC Sim, Z Huo, T Strohman - arxiv preprint arxiv:2203.12668, 2022 - arxiv.org
State-of-the-art automatic speech recognition (ASR) systems are trained with tens of
thousands of hours of labeled speech data. Human transcription is expensive and time …