Word discovery in visually grounded, self-supervised speech models
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …
improving performance and data efficiency on various speech tasks. However, these …
Phone-to-audio alignment without text: A semi-supervised approach
The task of phone-to-audio alignment has many applications in speech research. Here we
introduce two Wav2Vec2-based models for both text-dependent and text-independent …
introduce two Wav2Vec2-based models for both text-dependent and text-independent …
Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge
E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …
possibility of building a full speech processing system from raw audio without using any …
Word segmentation on discovered phone units with dynamic programming and self-supervised scoring
H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …
phone and word segmentation modules that are trained jointly. This paper instead revisits …
A brief overview of unsupervised neural speech representation learning
Unsupervised representation learning for speech processing has matured greatly in the last
few years. Work in computer vision and natural language processing has paved the way, but …
few years. Work in computer vision and natural language processing has paved the way, but …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
producing performance and data efficiency improvements for a variety of speech tasks …
producing performance and data efficiency improvements for a variety of speech tasks …
Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
The success of deep learning comes from its ability to capture the hierarchical structure of
data by learning high-level representations defined in terms of low-level ones. In this paper …
data by learning high-level representations defined in terms of low-level ones. In this paper …
Efficient transformers with dynamic token pooling
Transformers achieve unrivalled performance in modelling language, but remain inefficient
in terms of memory and time complexity. A possible remedy is to reduce the sequence …
in terms of memory and time complexity. A possible remedy is to reduce the sequence …
On compressing sequences for self-supervised speech models
Compressing self-supervised models has become increasingly necessary, as self-
supervised models become larger. While previous approaches have primarily focused on …
supervised models become larger. While previous approaches have primarily focused on …