Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …
On generative spoken language modeling from raw audio
Abstract We introduce Generative Spoken Language Modeling, the task of learning the
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …
Tera: Self-supervised learning of transformer encoder representation for speech
We introduce a self-supervised speech pre-training method called TERA, which stands for
Transformer Encoder Representations from Alteration. Recent approaches often learn by …
Transformer Encoder Representations from Alteration. Recent approaches often learn by …
HuBERT: How much can a bad teacher benefit ASR pre-training?
Compared to vision and language applications, self-supervised pre-training approaches for
ASR are challenged by three unique problems:(1) There are multiple sound units in each …
ASR are challenged by three unique problems:(1) There are multiple sound units in each …
ScanDMM: A deep markov model of scanpath prediction for 360deg images
Scanpath prediction for 360deg images aims to produce dynamic gaze behaviors based on
the human visual perception mechanism. Most existing scanpath prediction methods for …
the human visual perception mechanism. Most existing scanpath prediction methods for …
Textless speech emotion conversion using discrete and decomposed representations
Speech emotion conversion is the task of modifying the perceived emotion of a speech
utterance while preserving the lexical content and speaker identity. In this study, we cast the …
utterance while preserving the lexical content and speaker identity. In this study, we cast the …
Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation
We propose the (): S emantically-A ligned M ultimodal U tterance-level Cross-L ingual S
peech R epresentation learning framework. Unlike previous works on speech representation …
peech R epresentation learning framework. Unlike previous works on speech representation …
A brief overview of unsupervised neural speech representation learning
Unsupervised representation learning for speech processing has matured greatly in the last
few years. Work in computer vision and natural language processing has paved the way, but …
few years. Work in computer vision and natural language processing has paved the way, but …
Aligned contrastive predictive coding
J Chorowski, G Ciesielski, J Dzikowski… - arxiv preprint arxiv …, 2021 - arxiv.org
We investigate the possibility of forcing a self-supervised model trained using a contrastive
predictive loss to extract slowly varying latent representations. Rather than producing …
predictive loss to extract slowly varying latent representations. Rather than producing …