[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

Unsupervised learning of spoken language with visual context

D Harwath, A Torralba, J Glass - Advances in neural …, 2016 - proceedings.neurips.cc
Humans learn to speak before they can read or write, so why can't computers do the same?
In this paper, we present a deep neural network model capable of rudimentary spoken …

Jointly discovering visual objects and spoken words from raw sensory input

D Harwath, A Recasens, D Surís… - Proceedings of the …, 2018 - openaccess.thecvf.com
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …

Effectiveness of self-supervised pre-training for speech recognition

A Baevski, M Auli, A Mohamed - arxiv preprint arxiv:1911.03912, 2019 - arxiv.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Tied multitask learning for neural speech translation

A Anastasopoulos, D Chiang - arxiv preprint arxiv:1802.06655, 2018 - arxiv.org
We explore multitask models for neural translation of speech, augmenting them in order to
reflect two intuitive notions. First, we introduce a model where the second task decoder …

Pre-training on high-resource speech recognition improves low-resource speech-to-text translation

S Bansal, H Kamper, K Livescu, A Lopez… - arxiv preprint arxiv …, 2018 - arxiv.org
We present a simple approach to improve direct speech-to-text translation (ST) when the
source language is low-resource: we pre-train the model on a high-resource automatic …

Effectiveness of self-supervised pre-training for asr

A Baevski, A Mohamed - ICASSP 2020-2020 IEEE International …, 2020 - ieeexplore.ieee.org
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …

Recent developments in spoken term detection: a survey

A Mandal, KR Prasanna Kumar, P Mitra - International Journal of Speech …, 2014 - Springer
Spoken term detection (STD) provides an efficient means for content based indexing of
speech. However, achieving high detection performance, faster speed, detecting ot-of …

Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arxiv preprint arxiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arxiv preprint arxiv:1911.09602, 2019 - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …