Advanced long-content speech recognition with factorized neural transducer

X Gong, Y Wu, J Li, S Liu, R Zhao… - … /ACM Transactions on …, 2024‏ - ieeexplore.ieee.org
Long-content automatic speech recognition (ASR) has obtained increasing interest in recent
years, as it captures the relationship among consecutive historical utterances while …

Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation

K Wei, B Li, H Lv, Q Lu, N Jiang… - IEEE/ACM Transactions …, 2024‏ - ieeexplore.ieee.org
Automatic Speech Recognition (ASR) in conversational settings presents unique
challenges, including extracting relevant contextual information from previous …

Towards effective and compact contextual representation for conformer transducer speech recognition systems

M Cui, J Kang, J Deng, X Yin, Y **e, X Chen… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Current ASR systems are mainly trained and evaluated at the utterance level. Long range
cross utterance context can be incorporated. A key task is to derive a suitable compact …

Longfnt: Long-form speech recognition with factorized neural transducer

X Gong, Y Wu, J Li, S Liu, R Zhao… - ICASSP 2023-2023 …, 2023‏ - ieeexplore.ieee.org
Traditional automatic speech recognition (ASR) systems usually focus on individual
utterances, without considering long-form speech with useful historical information, which is …

Context-aware fine-tuning of self-supervised speech models

S Shon, F Wu, K Kim, P Sridhar… - ICASSP 2023-2023 …, 2023‏ - ieeexplore.ieee.org
Self-supervised pre-trained transformers have improved the state of the art on a variety of
speech tasks. Due to the quadratic time and space complexity of self-attention, they usually …

Updated Corpora and Benchmarks for Long-Form Speech Recognition

JD Fox, D Raj, N Delworth, Q McNamara… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
The vast majority of ASR research uses corpora in which both the training and test data have
been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio …

Efficient Long-Form Speech Recognition for General Speech In-Context Learning

H Yen, S Ling, G Ye - arxiv preprint arxiv:2409.19757, 2024‏ - arxiv.org
We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve
efficient speech in-context learning (SICL) for (i) long-form speech decoding,(ii) test-time …

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

M Cui, Y Yang, J Deng, J Kang, S Hu, T Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Self-supervised learning (SSL) based discrete speech representations are highly compact
and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM …

Generative Context-Aware Fine-Tuning of Self-Supervised Speech Models

S Shon, K Kim, P Sridhar, YT Hsu… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
When performing tasks like automatic speech recognition or spoken language
understanding for a given utterance, access to preceding text or audio provides contextual …

4 Cross-Modal Generation of Visual and Auditory

F Gao, M Liu, Y Zhou - Artificial Intelligence for Art Creation and …, 2024‏ - books.google.com
With the breakthrough progress of generative models in the field of AI painting, AIGC has
attracted widespread attention and become one of the hottest research directions driving the …