Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arxiv preprint arxiv:2203.15081, 2022 - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

The zero resource speech challenge 2020: Discovering discrete subword and word units

E Dunbar, J Karadayi, M Bernard, XN Cao… - … 2020-Conference of …, 2020 - hal.science
We present the Zero Resource Speech Challenge 2020, which aims at learning speech
representations from raw audio signals without any labels. It combines the data sets and …

Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge

E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …

Global prosody style transfer without text transcriptions

K Qian, Y Zhang, S Chang, J **ong… - International …, 2021 - proceedings.mlr.press
Prosody plays an important role in characterizing the style of a speaker or an emotion, but
most non-parallel voice or emotion style transfer algorithms do not convert any prosody …

Word segmentation on discovered phone units with dynamic programming and self-supervised scoring

H Kamper - IEEE/ACM Transactions on Audio, Speech, and …, 2022 - ieeexplore.ieee.org
Recent work on unsupervised speech segmentation has used self-supervised models with
phone and word segmentation modules that are trained jointly. This paper instead revisits …

DP-Parse: Finding word boundaries from raw speech with an instance lexicon

R Algayres, T Ricoul, J Karadayi… - Transactions of the …, 2022 - direct.mit.edu
Finding word boundaries in continuous speech is challenging as there is little or no
equivalent of a 'space'delimiter between words. Popular Bayesian non-parametric models …

A study of bias mitigation strategies for speaker recognition

R Peri, K Somandepalli, S Narayanan - Computer Speech & Language, 2023 - Elsevier
Speaker recognition is increasingly used in several everyday applications including smart
speakers, customer care centers and other speech-driven analytics. It is crucial to accurately …

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

P Peng, SW Li, O Räsänen, A Mohamed… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we show that representations capturing syllabic units emerge when training a
self-supervised speech model with a visually-grounded training objective. We demonstrate …

Spoken-Term Discovery using Discrete Speech Units

B van Niekerk, J Zaïdi, MA Carbonneau… - arxiv preprint arxiv …, 2024 - arxiv.org
Discovering a lexicon from unlabeled audio is a longstanding challenge for zero-resource
speech processing. One approach is to search for frequently occurring patterns in speech …

Slowness Regularized Contrastive Predictive Coding for Acoustic Unit Discovery

S Bhati, J Villalba, P Żelasko… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Self-supervised methods such as Contrastive predictive Coding (CPC) have greatly
improved the quality of the unsupervised representations. These representations …