Word discovery in visually grounded, self-supervised speech models
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …
improving performance and data efficiency on various speech tasks. However, these …
Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge
E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …
possibility of building a full speech processing system from raw audio without using any …
Word segmentation on discovered phone units with dynamic programming and self-supervised scoring
H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …
phone and word segmentation modules that are trained jointly. This paper instead revisits …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …
producing performance and data efficiency improvements for a variety of speech tasks …
Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
The success of deep learning comes from its ability to capture the hierarchical structure of
data by learning high-level representations defined in terms of low-level ones. In this paper …
data by learning high-level representations defined in terms of low-level ones. In this paper …
DP-Parse: Finding word boundaries from raw speech with an instance lexicon
R Algayres, T Ricoul, J Karadayi… - Transactions of the …, 2022 - direct.mit.edu
Finding word boundaries in continuous speech is challenging as there is little or no
equivalent of a 'space'delimiter between words. Popular Bayesian non-parametric models …
equivalent of a 'space'delimiter between words. Popular Bayesian non-parametric models …
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
In this paper, we show that representations capturing syllabic units emerge when training a
self-supervised speech model with a visually-grounded training objective. We demonstrate …
self-supervised speech model with a visually-grounded training objective. We demonstrate …
Unsupervised word segmentation using k nearest neighbors
In this paper, we propose an unsupervised kNN-based approach for word segmentation in
speech utterances. Our method relies on self-supervised pre-trained speech …
speech utterances. Our method relies on self-supervised pre-trained speech …
Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings
Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech
segments that encode phonetic content so that different realisations of the same word have …
segments that encode phonetic content so that different realisations of the same word have …