Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Layer-wise analysis of a self-supervised speech representation model

A Pasad, JC Chou, K Livescu - 2021 IEEE Automatic Speech …, 2021 - ieeexplore.ieee.org
Recently proposed self-supervised learning approaches have been successful for pre-
training speech representation models. The utility of these learned representations has been …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

Medical image classification using synergic deep learning

J Zhang, Y **e, Q Wu, Y **a - Medical image analysis, 2019 - Elsevier
The classification of medical images is an essential task in computer-aided diagnosis,
medical image retrieval and mining. Although deep learning has shown proven advantages …

Jointly discovering visual objects and spoken words from raw sensory input

D Harwath, A Recasens, D Surís… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …

Effectiveness of self-supervised pre-training for speech recognition

A Baevski, M Auli, A Mohamed - arxiv preprint arxiv:1911.03912, 2019 - arxiv.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Unsupervised learning of semantic audio representations

A Jansen, M Plakal, R Pandya… - … on acoustics, speech …, 2018 - ieeexplore.ieee.org
Even in the absence of any explicit semantic annotation, vast collections of audio recordings
provide valuable information for learning the categorical structure of sounds. We consider …

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

S Bansal, H Kamper, K Livescu, A Lopez… - arxiv preprint arxiv …, 2018 - arxiv.org
We present a simple approach to improve direct speech-to-text translation (ST) when the
source language is low-resource: we pre-train the model on a high-resource automatic …

[PDF][PDF] The zero resource speech challenge 2015.

M Versteegh, R Thiolliere, T Schatz, XN Cao… - Interspeech, 2015 - isca-archive.org
Abstract The Interspeech 2015 Zero Resource Speech Challenge aims at discovering
subword and word units from raw speech. The challenge provides the first unified and open …

Deep convolutional acoustic word embeddings using word-pair side information

H Kamper, W Wang, K Livescu - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Recent studies have been revisiting whole words as the basic modelling unit in speech
recognition and query applications, instead of phonetic units. Such whole-word segmental …