Robust speech recognition via large-scale weak supervision
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …
Espnet-slu: Advancing spoken language understanding through espnet
As Automatic Speech Processing (ASR) systems are getting better, there is an increasing
interest of using the ASR output to do downstream Natural Language Processing (NLP) …
interest of using the ASR output to do downstream Natural Language Processing (NLP) …
Fast conformer with linearly scalable attention for efficient speech recognition
Conformer-based models have become the dominant end-to-end architecture for speech
processing tasks. With the objective of enhancing the conformer architecture for efficient …
processing tasks. With the objective of enhancing the conformer architecture for efficient …
Audiobench: A universal benchmark for audio large language models
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …
Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among …
Less is more: Accurate speech recognition & translation without web-scale data
Recent advances in speech recognition and translation rely on hundreds of thousands of
hours of Internet speech data. We argue that state-of-the art accuracy can be reached …
hours of Internet speech data. We argue that state-of-the art accuracy can be reached …
A study on the integration of pre-trained ssl, asr, lm and slu models for spoken language understanding
Collecting sufficient labeled data for spoken language understanding (SLU) is expensive
and time-consuming. Recent studies achieved promising results by using pre-trained …
and time-consuming. Recent studies achieved promising results by using pre-trained …
VarArray: Array-geometry-agnostic continuous speech separation
Continuous speech separation using a microphone array was shown to be promising in
dealing with the speech overlap problem in natural conversation transcription. This paper …
dealing with the speech overlap problem in natural conversation transcription. This paper …
Token-level sequence labeling for spoken language understanding using compositional end-to-end models
End-to-end spoken language understanding (SLU) systems are gaining popularity over
cascaded approaches due to their simplicity and ability to avoid error propagation. However …
cascaded approaches due to their simplicity and ability to avoid error propagation. However …
Residual language model for end-to-end speech recognition
End-to-end automatic speech recognition suffers from adaptation to unknown target domain
speech despite being trained with a large amount of paired audio--text data. Recent studies …
speech despite being trained with a large amount of paired audio--text data. Recent studies …
Improving contextual recognition of rare words with an alternate spelling prediction model
JD Fox, N Delworth - arxiv preprint arxiv:2209.01250, 2022 - arxiv.org
Contextual ASR, which takes a list of bias terms as input along with audio, has drawn recent
interest as ASR use becomes more widespread. We are releasing contextual biasing lists to …
interest as ASR use becomes more widespread. We are releasing contextual biasing lists to …