What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - Transactions of the …, 2024 - direct.mit.edu
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - arxiv preprint arxiv:2307.00162, 2023 - arxiv.org
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …

[PDF][PDF] Mixed children/adult/childrenized fine-tuning for children's asr: How to reduce age mismatch and speaking style mismatch

T Graave, Z Li, T Lohrenz, T Fingscheidt - Proc. Interspeech 2024, 2024 - isca-archive.org
Today's end-to-end (E2E) ASR models achieve strong performance when applied to adult
speech, but deteriorate on children's speech. Most E2E ASR models are pre-trained on adult …

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

A Rouditchenko, Y Gong, S Thomas, L Karlinsky… - arxiv preprint arxiv …, 2024 - arxiv.org
Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in
noise. Since videos are harder to obtain than audio, the video training data of AVSR models …

mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition

A Rouditchenko, S Bhati, S Thomas, H Kuehne… - arxiv preprint arxiv …, 2025 - arxiv.org
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can
improve performance in noise, but most methods are trained only on English data. One …

Probing self-supervised learning models with target speech extraction

J Peng, M Delcroix, T Ochiai, O Plchot… - … , Speech, and Signal …, 2024 - ieeexplore.ieee.org
Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable
advancements in speech-related tasks. However, the utilization of these models in complex …

[HTML][HTML] Interleaved audio/audiovisual transfer learning for AV-ASR in low-resourced languages

Z Li, P Blumenberg, J Liu, T Graave, T Lohrenz… - 2024 - amazon.science
Cross-language transfer learning from English to a target language has shown effectiveness
in low-resourced audiovisual speech recognition (AV-ASR). We first investigate a 2-stage …

[PDF][PDF] Leveraging Adapter for Parameter-Efficient ASR Encoder

K Shim, J Lee, H Kim - Proc. Interspeech 2024, 2024 - isca-archive.org
The expansion of speech models emphasizes the importance of parameter efficiency in
practical automatic speech recognition (ASR) systems. Parameter sharing, which reuses the …

CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

J Peng, L Mošner, L Zhang, O Plchot… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-supervised learning (SSL) models for speaker verification (SV) have gained significant
attention in recent years. However, existing SSL-based SV systems often struggle to capture …

ConEC: Earnings call dataset with real-world contexts for benchmarking contextual speech recognition

R Huang, M Yarmohammadi, J Trmal… - Proceedings of the …, 2024 - aclanthology.org
Knowing the particular context associated with a conversation can help improving the
performance of an automatic speech recognition (ASR) system. For example, if we are …