A review of deep learning techniques for speech processing
The field of speech processing has undergone a transformative shift with the advent of deep
learning. The use of multiple processing layers has enabled the creation of models capable …
learning. The use of multiple processing layers has enabled the creation of models capable …
From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy
Modality is a source or form of information. Through various modal information, humans can
perceive the world from multiple perspectives. Simultaneously, the observation of remote …
perceive the world from multiple perspectives. Simultaneously, the observation of remote …
Robust speech recognition via large-scale weak supervision
We study the capabilities of speech processing systems trained simply to predict large
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …
amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual …
XLS-R: Self-supervised cross-lingual speech representation learning at scale
This paper presents XLS-R, a large-scale model for cross-lingual speech representation
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …
learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a …
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …
Ts2vec: Towards universal representation of time series
This paper presents TS2Vec, a universal framework for learning representations of time
series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive …
series in an arbitrary semantic level. Unlike existing methods, TS2Vec performs contrastive …
Going deeper with image transformers
Transformers have been recently adapted for large scale image classification, achieving
high scores shaking up the long supremacy of convolutional neural networks. However the …
high scores shaking up the long supremacy of convolutional neural networks. However the …
W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training
Motivated by the success of masked language modeling (MLM) in pre-training natural
language processing models, we propose w2v-BERT that explores MLM for self-supervised …
language processing models, we propose w2v-BERT that explores MLM for self-supervised …
Fleurs: Few-shot learning evaluation of universal representations of speech
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of
Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on …
Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on …
Unsupervised speech recognition
Despite rapid progress in the recent past, current speech recognition systems still require
labeled training data which limits this technology to a small fraction of the languages spoken …
labeled training data which limits this technology to a small fraction of the languages spoken …