Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

WN Hsu, B Bolte, YHH Tsai, K Lakhotia… - … ACM transactions on …, 2021 - ieeexplore.ieee.org
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …

On generative spoken language modeling from raw audio

K Lakhotia, E Kharitonov, WN Hsu, Y Adi… - Transactions of the …, 2021 - direct.mit.edu
Abstract We introduce Generative Spoken Language Modeling, the task of learning the
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …

Tera: Self-supervised learning of transformer encoder representation for speech

AT Liu, SW Li, H Lee - IEEE/ACM Transactions on Audio …, 2021 - ieeexplore.ieee.org
We introduce a self-supervised speech pre-training method called TERA, which stands for
Transformer Encoder Representations from Alteration. Recent approaches often learn by …

HuBERT: How much can a bad teacher benefit ASR pre-training?

WN Hsu, YHH Tsai, B Bolte… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Compared to vision and language applications, self-supervised pre-training approaches for
ASR are challenged by three unique problems:(1) There are multiple sound units in each …

ScanDMM: A deep markov model of scanpath prediction for 360deg images

X Sui, Y Fang, H Zhu, S Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Scanpath prediction for 360deg images aims to produce dynamic gaze behaviors based on
the human visual perception mechanism. Most existing scanpath prediction methods for …

Textless speech emotion conversion using discrete and decomposed representations

F Kreuk, A Polyak, J Copet, E Kharitonov… - arxiv preprint arxiv …, 2021 - arxiv.org
Speech emotion conversion is the task of modifying the perceived emotion of a speech
utterance while preserving the lexical content and speaker identity. In this study, we cast the …

Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation

S Khurana, A Laurent, J Glass - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
We propose the (): S emantically-A ligned M ultimodal U tterance-level Cross-L ingual S
peech R epresentation learning framework. Unlike previous works on speech representation …

A brief overview of unsupervised neural speech representation learning

L Borgholt, JD Havtorn, J Edin, L Maaløe… - arxiv preprint arxiv …, 2022 - arxiv.org
Unsupervised representation learning for speech processing has matured greatly in the last
few years. Work in computer vision and natural language processing has paved the way, but …

Aligned contrastive predictive coding

J Chorowski, G Ciesielski, J Dzikowski… - arxiv preprint arxiv …, 2021 - arxiv.org
We investigate the possibility of forcing a self-supervised model trained using a contrastive
predictive loss to extract slowly varying latent representations. Rather than producing …