Robust speech recognition via large-scale weak supervision

A Radford, JW Kim, T Xu, G Brockman… - International …, 2023 - proceedings.mlr.press
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …

Espnet-slu: Advancing spoken language understanding through espnet

S Arora, S Dalmia, P Denisov, X Chang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
As Automatic Speech Processing (ASR) systems are getting better, there is an increasing
interest of using the ASR output to do downstream Natural Language Processing (NLP) …

Fast conformer with linearly scalable attention for efficient speech recognition

D Rekesh, NR Koluguri, S Kriman… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Conformer-based models have become the dominant end-to-end architecture for speech
processing tasks. With the objective of enhancing the conformer architecture for efficient …

Audiobench: A universal benchmark for audio large language models

B Wang, X Zou, G Lin, S Sun, Z Liu, W Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …

Less is more: Accurate speech recognition & translation without web-scale data

KC Puvvada, P Żelasko, H Huang, O Hrinchuk… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in speech recognition and translation rely on hundreds of thousands of
hours of Internet speech data. We argue that state-of-the art accuracy can be reached …

A study on the integration of pre-trained ssl, asr, lm and slu models for spoken language understanding

Y Peng, S Arora, Y Higuchi, Y Ueda… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive
and time-consuming. Recent studies achieved promising results by using pre-trained …

VarArray: Array-geometry-agnostic continuous speech separation

T Yoshioka, X Wang, D Wang, M Tang… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Continuous speech separation using a microphone array was shown to be promising in
dealing with the speech overlap problem in natural conversation transcription. This paper …

Token-level sequence labeling for spoken language understanding using compositional end-to-end models

S Arora, S Dalmia, B Yan, F Metze, AW Black… - arxiv preprint arxiv …, 2022 - arxiv.org
End-to-end spoken language understanding (SLU) systems are gaining popularity over
cascaded approaches due to their simplicity and ability to avoid error propagation. However …

Residual language model for end-to-end speech recognition

E Tsunoo, Y Kashiwagi, C Narisetty… - arxiv preprint arxiv …, 2022 - arxiv.org
End-to-end automatic speech recognition suffers from adaptation to unknown target domain
speech despite being trained with a large amount of paired audio--text data. Recent studies …

Improving contextual recognition of rare words with an alternate spelling prediction model

JD Fox, N Delworth - arxiv preprint arxiv:2209.01250, 2022 - arxiv.org
Contextual ASR, which takes a list of bias terms as input along with audio, has drawn recent
interest as ASR use becomes more widespread. We are releasing contextual biasing lists to …