Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers

Y Gong, S Khurana, L Karlinsky, J Glass - arxiv preprint arxiv:2307.03183, 2023 - arxiv.org
In this paper, we focus on Whisper, a recent automatic speech recognition model trained
with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first …

A joint speech enhancement and self-supervised representation learning framework for noise-robust speech recognition

QS Zhu, J Zhang, ZQ Zhang… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
Though speech enhancement (SE) can be used to improve speech quality in noisy
environments, it may also cause distortions that degrade the performance of automatic …

How does pre-trained wav2vec 2.0 perform on domain-shifted asr? an extensive benchmark on air traffic control communications

J Zuluaga-Gomez, A Prasad… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled
speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine …

Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning

QS Zhu, L Zhou, J Zhang, SJ Liu… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Self-supervised pre-training methods based on contrastive learning or regression tasks can
utilize more unlabeled data to improve the performance of automatic speech recognition …

Improving distortion robustness of self-supervised speech processing tasks with domain adaptation

KP Huang, YK Fu, Y Zhang, H Lee - arxiv preprint arxiv:2203.16104, 2022 - arxiv.org
Speech distortions are a long-standing problem that degrades the performance of
supervisely trained speech processing models. It is high time that we enhance the …

Gradient remedy for multi-task learning in end-to-end noise-robust speech recognition

Y Hu, C Chen, R Li, Q Zhu… - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Speech enhancement (SE) is proved effective in reducing noise from noisy speech signals
for downstream automatic speech recognition (ASR), where multi-task learning strategy is …

Wav2code: Restore clean speech representations via codebook lookup for noise-robust asr

Y Hu, C Chen, Q Zhu, ES Chng - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
Automatic speech recognition (ASR) has gained remarkable successes thanks to recent
advances of deep learning, but it usually degrades significantly under real-world noisy …

De'hubert: Disentangling noise in a self-supervised model for robust speech recognition

D Ng, R Zhang, JQ Yip, Z Yang, J Ni… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Existing self-supervised pre-trained speech models have offered an effective way to
leverage massive unannotated corpora to build good automatic speech recognition (ASR) …