A review of deep learning techniques for speech processing

A Mehrish, N Majumder, R Bharadwaj, R Mihalcea… - Information …, 2023 - Elsevier
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …

From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy

X Sun, Y Tian, W Lu, P Wang, R Niu, H Yu… - Science China Information …, 2023 - Springer
Modality is a source or form of information. Through various modal information, humans can
perceive the world from multiple perspectives. Simultaneously, the observation of remote …

Robust speech recognition via large-scale weak supervision

A Radford, JW Kim, T Xu, G Brockman… - International …, 2023 - proceedings.mlr.press
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …

XLS-R: Self-supervised cross-lingual speech representation learning at scale

A Babu, C Wang, A Tjandra, K Lakhotia, Q Xu… - arxiv preprint arxiv …, 2021 - arxiv.org
This paper presents XLS-R, a large-scale model for cross-lingual speech representation
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

WN Hsu, B Bolte, YHH Tsai, K Lakhotia… - … ACM transactions on …, 2021 - ieeexplore.ieee.org
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …

Ts2vec: Towards universal representation of time series

Z Yue, Y Wang, J Duan, T Yang, C Huang… - Proceedings of the …, 2022 - ojs.aaai.org
This paper presents TS2Vec, a universal framework for learning representations of time
series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive …

Going deeper with image transformers

H Touvron, M Cord, A Sablayrolles… - Proceedings of the …, 2021 - openaccess.thecvf.com
Transformers have been recently adapted for large scale image classification, achieving
high scores shaking up the long supremacy of convolutional neural networks. However the …

W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training

YA Chung, Y Zhang, W Han, CC Chiu… - 2021 IEEE Automatic …, 2021 - ieeexplore.ieee.org
Motivated by the success of masked language modeling (MLM) in pre-training natural
language processing models, we propose w2v-BERT that explores MLM for self-supervised …

Fleurs: Few-shot learning evaluation of universal representations of speech

A Conneau, M Ma, S Khanuja, Y Zhang… - 2022 IEEE Spoken …, 2023 - ieeexplore.ieee.org
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of
Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on …

Unsupervised speech recognition

A Baevski, WN Hsu, A Conneau… - Advances in Neural …, 2021 - proceedings.neurips.cc
Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …