Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - Transactions of the …, 2024 - direct.mit.edu
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …

Analyzing acoustic word embeddings from pre-trained self-supervised speech models

R Sanabria, H Tang, S Goldwater - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Given the strong results of self-supervised models on various tasks, there have been
surprisingly few studies exploring self-supervised representations for acoustic word …

Layer-wise analysis of self-supervised acoustic word embeddings: A study on speech emotion recognition

A Saliba, Y Li, R Sanabria, C Lai - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
The efficacy of self-supervised speech models has been validated, yet the optimal utilization
of their representations remains challenging across diverse tasks. In this study, we delve into …

Generative spoken language model based on continuous word-sized audio tokens

R Algayres, Y Adi, TA Nguyen, J Copet… - arxiv preprint arxiv …, 2023 - arxiv.org
In NLP, text language models based on words or subwords are known to outperform their
character-based counterparts. Yet, in the speech community, the standard input of spoken …

Configurable privacy-preserving automatic speech recognition

R Aloufi, H Haddadi, D Boyle - arxiv preprint arxiv:2104.00766, 2021 - arxiv.org
Voice assistive technologies have given rise to far-reaching privacy and security concerns.
In this paper we investigate whether modular automatic speech recognition (ASR) can …

Self-supervised acoustic word embedding learning via correspondence transformer encoder

J Lin, X Yue, J Ao, H Li - arxiv preprint arxiv:2307.09871, 2023 - arxiv.org
Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a
fixed-dimensional representation. High-quality AWEs should be invariant to variations, such …

Supervised acoustic embeddings and their transferability across languages

S Ram, H Aldarmaki - arxiv preprint arxiv:2301.01020, 2023 - arxiv.org
In speech recognition, it is essential to model the phonetic content of the input signal while
discarding irrelevant factors such as speaker variations and noise, which is challenging in …

Direct multimodal few-shot learning of speech and images

L Nortje, H Kamper - arxiv preprint arxiv:2012.05680, 2020 - arxiv.org
We propose direct multimodal few-shot models that learn a shared embedding space of
spoken words and images from only a few paired examples. Imagine an agent is shown an …