What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …
improving performance and data efficiency on various speech tasks. However, these …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …
producing performance and data efficiency improvements for a variety of speech tasks …
[PDF][PDF] Mixed children/adult/childrenized fine-tuning for children's asr: How to reduce age mismatch and speaking style mismatch
Today's end-to-end (E2E) ASR models achieve strong performance when applied to adult
speech, but deteriorate on children's speech. Most E2E ASR models are pre-trained on adult …
speech, but deteriorate on children's speech. Most E2E ASR models are pre-trained on adult …
Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …
mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can
improve performance in noise, but most methods are trained only on English data. One …
improve performance in noise, but most methods are trained only on English data. One …
Probing self-supervised learning models with target speech extraction
Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable
advancements in speech-related tasks. However, the utilization of these models in complex …
advancements in speech-related tasks. However, the utilization of these models in complex …
[HTML][HTML] Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages
Cross-language transfer learning from English to a target language has shown effectiveness
in low-resourced audiovisual speech recognition (AV-ASR). We first investigate a 2-stage …
in low-resourced audiovisual speech recognition (AV-ASR). We first investigate a 2-stage …
[PDF][PDF] Leveraging Adapter for Parameter-Efficient ASR Encoder
K Shim, J Lee, H Kim - Proc. Interspeech 2024, 2024 - isca-archive.org
The expansion of speech models emphasizes the importance of parameter efficiency in
practical automatic speech recognition (ASR) systems. Parameter sharing, which reuses the …
practical automatic speech recognition (ASR) systems. Parameter sharing, which reuses the …
CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification
Self-supervised learning (SSL) models for speaker verification (SV) have gained significant
attention in recent years. However, existing SSL-based SV systems often struggle to capture …
attention in recent years. However, existing SSL-based SV systems often struggle to capture …
ConEC: Earnings call dataset with real-world contexts for benchmarking contextual speech recognition
Knowing the particular context associated with a conversation can help improving the
performance of an automatic speech recognition (ASR) system. For example, if we are …
performance of an automatic speech recognition (ASR) system. For example, if we are …