Advanced long-content speech recognition with factorized neural transducer
Long-content automatic speech recognition (ASR) has obtained increasing interest in recent
years, as it captures the relationship among consecutive historical utterances while …
years, as it captures the relationship among consecutive historical utterances while …
Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous …
challenges, including extracting relevant contextual information from previous …
Towards effective and compact contextual representation for conformer transducer speech recognition systems
Current ASR systems are mainly trained and evaluated at the utterance level. Long range
cross utterance context can be incorporated. A key task is to derive a suitable compact …
cross utterance context can be incorporated. A key task is to derive a suitable compact …
Longfnt: Long-form speech recognition with factorized neural transducer
Traditional automatic speech recognition (ASR) systems usually focus on individual
utterances, without considering long-form speech with useful historical information, which is …
utterances, without considering long-form speech with useful historical information, which is …
Context-aware fine-tuning of self-supervised speech models
Self-supervised pre-trained transformers have improved the state of the art on a variety of
speech tasks. Due to the quadratic time and space complexity of self-attention, they usually …
speech tasks. Due to the quadratic time and space complexity of self-attention, they usually …
Updated Corpora and Benchmarks for Long-Form Speech Recognition
The vast majority of ASR research uses corpora in which both the training and test data have
been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio …
been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio …
Efficient Long-Form Speech Recognition for General Speech In-Context Learning
We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve
efficient speech in-context learning (SICL) for (i) long-form speech decoding,(ii) test-time …
efficient speech in-context learning (SICL) for (i) long-form speech decoding,(ii) test-time …
Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR
Self-supervised learning (SSL) based discrete speech representations are highly compact
and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM …
and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM …
Generative Context-Aware Fine-Tuning of Self-Supervised Speech Models
When performing tasks like automatic speech recognition or spoken language
understanding for a given utterance, access to preceding text or audio provides contextual …
understanding for a given utterance, access to preceding text or audio provides contextual …
4 Cross-Modal Generation of Visual and Auditory
F Gao, M Liu, Y Zhou - Artificial Intelligence for Art Creation and …, 2024 - books.google.com
With the breakthrough progress of generative models in the field of AI painting, AIGC has
attracted widespread attention and become one of the hottest research directions driving the …
attracted widespread attention and become one of the hottest research directions driving the …