[HTML][HTML] Unsupervised automatic speech recognition: A review
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …
remarkable performance given large amounts of manually transcribed speech, but large …
Unsupervised learning of spoken language with visual context
Humans learn to speak before they can read or write, so why can't computers do the same?
In this paper, we present a deep neural network model capable of rudimentary spoken …
In this paper, we present a deep neural network model capable of rudimentary spoken …
Jointly discovering visual objects and spoken words from raw sensory input
In this paper, we explore neural network models that learn to associate segments of spoken
audio captions with the semantically relevant portions of natural images that they refer to …
audio captions with the semantically relevant portions of natural images that they refer to …
Effectiveness of self-supervised pre-training for speech recognition
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …
quantize the audio data or learn representations without quantization. We find the former to …
Tied multitask learning for neural speech translation
We explore multitask models for neural translation of speech, augmenting them in order to
reflect two intuitive notions. First, we introduce a model where the second task decoder …
reflect two intuitive notions. First, we introduce a model where the second task decoder …
Pre-training on high-resource speech recognition improves low-resource speech-to-text translation
We present a simple approach to improve direct speech-to-text translation (ST) when the
source language is low-resource: we pre-train the model on a high-resource automatic …
source language is low-resource: we pre-train the model on a high-resource automatic …
Effectiveness of self-supervised pre-training for asr
We compare self-supervised representation learning algorithms which either explicitly
quantize the audio data or learn representations without quantization. We find the former to …
quantize the audio data or learn representations without quantization. We find the former to …
Recent developments in spoken term detection: a survey
A Mandal, KR Prasanna Kumar, P Mitra - International Journal of Speech …, 2014 - Springer
Spoken term detection (STD) provides an efficient means for content based indexing of
speech. However, achieving high detection performance, faster speed, detecting ot-of …
speech. However, achieving high detection performance, faster speed, detecting ot-of …
Word discovery in visually grounded, self-supervised speech models
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
Learning hierarchical discrete linguistic units from visually-grounded speech
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …
vector quantization layers into neural models of visually grounded speech. We show that our …