Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

An overview of voice conversion and its challenges: From statistical modeling to deep learning

B Sisman, J Yamagishi, S King… - IEEE/ACM Transactions …, 2020 - ieeexplore.ieee.org
Speaker identity is one of the important characteristics of human speech. In voice
conversion, we change the speaker identity from one to another, while kee** the linguistic …

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

Hubert: Self-supervised speech representation learning by masked prediction of hidden units

WN Hsu, B Bolte, YHH Tsai, K Lakhotia… - … ACM transactions on …, 2021 - ieeexplore.ieee.org
Self-supervised approaches for speech representation learning are challenged by three
unique problems:(1) there are multiple sound units in each input utterance,(2) there is no …

On generative spoken language modeling from raw audio

K Lakhotia, E Kharitonov, WN Hsu, Y Adi… - Transactions of the …, 2021 - direct.mit.edu
Abstract We introduce Generative Spoken Language Modeling, the task of learning the
acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and …

Dynamical variational autoencoders: A comprehensive review

L Girin, S Leglaive, X Bie, J Diard, T Hueber… - arxiv preprint arxiv …, 2020 - arxiv.org
Variational autoencoders (VAEs) are powerful deep generative models widely used to
represent high-dimensional complex data through a low-dimensional latent space learned …

An unsupervised autoregressive model for speech representation learning

YA Chung, WN Hsu, H Tang, J Glass - arxiv preprint arxiv:1904.03240, 2019 - arxiv.org
This paper proposes a novel unsupervised autoregressive neural model for learning generic
speech representations. In contrast to other speech representation learning methods that …

HuBERT: How much can a bad teacher benefit ASR pre-training?

WN Hsu, YHH Tsai, B Bolte… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
Compared to vision and language applications, self-supervised pre-training approaches for
ASR are challenged by three unique problems:(1) There are multiple sound units in each …

Unsupervised learning of disentangled and interpretable representations from sequential data

WN Hsu, Y Zhang, J Glass - Advances in neural information …, 2017 - proceedings.neurips.cc
We present a factorized hierarchical variational autoencoder, which learns disentangled and
interpretable representations from sequential data without supervision. Specifically, we …

Learning latent representations for style control and transfer in end-to-end speech synthesis

YJ Zhang, S Pan, L He, ZH Ling - ICASSP 2019-2019 IEEE …, 2019 - ieeexplore.ieee.org
In this paper, we introduce the Variational Autoencoder (VAE) to an end-to-end speech
synthesis model, to learn the latent representation of speaking styles in an unsupervised …