Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022‏ - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Deep spoken keyword spotting: An overview

I López-Espejo, ZH Tan, JHL Hansen, J Jensen - IEEE Access, 2021‏ - ieeexplore.ieee.org
Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams
and has become a fast-growing technology thanks to the paradigm shift introduced by deep …

Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech

YA Chung, J Glass - arxiv preprint arxiv:1803.08976, 2018‏ - arxiv.org
In this paper, we propose a novel deep neural network architecture, Speech2Vec, for
learning fixed-length vector representations of audio segments excised from a speech …

Effectiveness of self-supervised pre-training for speech recognition

A Baevski, M Auli, A Mohamed - arxiv preprint arxiv:1911.03912, 2019‏ - arxiv.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder

YA Chung, CC Wu, CH Shen, HY Lee… - arxiv preprint arxiv …, 2016‏ - arxiv.org
The vector representations of fixed dimensionality for words (in text) offered by Word2Vec
have been shown to be very useful in many application scenarios, in particular due to the …

Effectiveness of self-supervised pre-training for asr

A Baevski, A Mohamed - ICASSP 2020-2020 IEEE International …, 2020‏ - ieeexplore.ieee.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Query-by-example keyword spotting using long short-term memory networks

G Chen, C Parada, TN Sainath - 2015 IEEE international …, 2015‏ - ieeexplore.ieee.org
We present a novel approach to query-by-example keyword spotting (KWS) using a long
short-term memory (LSTM) recurrent neural network-based feature extractor. In our …

Deep convolutional acoustic word embeddings using word-pair side information

H Kamper, W Wang, K Livescu - 2016 IEEE International …, 2016‏ - ieeexplore.ieee.org
Recent studies have been revisiting whole words as the basic modelling unit in speech
recognition and query applications, instead of phonetic units. Such whole-word segmental …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022‏ - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

End-to-end ASR-free keyword search from speech

K Audhkhasi, A Rosenberg, A Sethy… - IEEE Journal of …, 2017‏ - ieeexplore.ieee.org
Conventional keyword search (KWS) systems for speech databases match the input text
query to the set of word hypotheses generated by an automatic speech recognition (ASR) …