Word discovery in visually grounded, self-supervised speech models
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …
The zero resource speech challenge 2020: Discovering discrete subword and word units
We present the Zero Resource Speech Challenge 2020, which aims at learning speech
representations from raw audio signals without any labels. It combines the data sets and …
representations from raw audio signals without any labels. It combines the data sets and …
Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge
E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …
possibility of building a full speech processing system from raw audio without using any …
Global prosody style transfer without text transcriptions
Prosody plays an important role in characterizing the style of a speaker or an emotion, but
most non-parallel voice or emotion style transfer algorithms do not convert any prosody …
most non-parallel voice or emotion style transfer algorithms do not convert any prosody …
Word segmentation on discovered phone units with dynamic programming and self-supervised scoring
H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …
phone and word segmentation modules that are trained jointly. This paper instead revisits …
DP-Parse: Finding word boundaries from raw speech with an instance lexicon
R Algayres, T Ricoul, J Karadayi… - Transactions of the …, 2022 - direct.mit.edu
Finding word boundaries in continuous speech is challenging as there is little or no
equivalent of a 'space'delimiter between words. Popular Bayesian non-parametric models …
equivalent of a 'space'delimiter between words. Popular Bayesian non-parametric models …
A study of bias mitigation strategies for speaker recognition
Speaker recognition is increasingly used in several everyday applications including smart
speakers, customer care centers and other speech-driven analytics. It is crucial to accurately …
speakers, customer care centers and other speech-driven analytics. It is crucial to accurately …
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
In this paper, we show that representations capturing syllabic units emerge when training a
self-supervised speech model with a visually-grounded training objective. We demonstrate …
self-supervised speech model with a visually-grounded training objective. We demonstrate …
Spoken-Term Discovery using Discrete Speech Units
Discovering a lexicon from unlabeled audio is a longstanding challenge for zero-resource
speech processing. One approach is to search for frequently occurring patterns in speech …
speech processing. One approach is to search for frequently occurring patterns in speech …
Slowness Regularized Contrastive Predictive Coding for Acoustic Unit Discovery
Self-supervised methods such as Contrastive predictive Coding (CPC) have greatly
improved the quality of the unsupervised representations. These representations …
improved the quality of the unsupervised representations. These representations …