Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arxiv preprint arxiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - Transactions of the …, 2024 - direct.mit.edu
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …

Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge

E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …

Word segmentation on discovered phone units with dynamic programming and self-supervised scoring

H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - arxiv preprint arxiv:2307.00162, 2023 - arxiv.org
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

S Cuervo, A Lancucki, R Marxer… - Advances in …, 2022 - proceedings.neurips.cc
The success of deep learning comes from its ability to capture the hierarchical structure of
data by learning high-level representations defined in terms of low-level ones. In this paper …

DP-Parse: Finding word boundaries from raw speech with an instance lexicon

R Algayres, T Ricoul, J Karadayi… - Transactions of the …, 2022 - direct.mit.edu
Finding word boundaries in continuous speech is challenging as there is little or no
equivalent of a 'space'delimiter between words. Popular Bayesian non-parametric models …

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

P Peng, SW Li, O Räsänen, A Mohamed… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we show that representations capturing syllabic units emerge when training a
self-supervised speech model with a visually-grounded training objective. We demonstrate …

Unsupervised word segmentation using k nearest neighbors

TS Fuchs, Y Hoshen, J Keshet - arxiv preprint arxiv:2204.13094, 2022 - arxiv.org
In this paper, we propose an unsupervised kNN-based approach for word segmentation in
speech utterances. Our method relies on self-supervised pre-trained speech …

Leveraging multilingual transfer for unsupervised semantic acoustic word embeddings

C Jacobs, H Kamper - IEEE Signal Processing Letters, 2023 - ieeexplore.ieee.org
Acoustic word embeddings (AWEs) are fixed-dimensional vector representations of speech
segments that encode phonetic content so that different realisations of the same word have …