Self-supervised speech representation learning: A review
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …
necessitated the building of specialist models for individual tasks and application scenarios …
[HTML][HTML] Unsupervised automatic speech recognition: A review
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …
remarkable performance given large amounts of manually transcribed speech, but large …
What do self-supervised speech models know about words?
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …
improving performance and data efficiency on various speech tasks. However, these …
Analyzing acoustic word embeddings from pre-trained self-supervised speech models
Given the strong results of self-supervised models on various tasks, there have been
surprisingly few studies exploring self-supervised representations for acoustic word …
surprisingly few studies exploring self-supervised representations for acoustic word …
Layer-wise analysis of self-supervised acoustic word embeddings: A study on speech emotion recognition
The efficacy of self-supervised speech models has been validated, yet the optimal utilization
of their representations remains challenging across diverse tasks. In this study, we delve into …
of their representations remains challenging across diverse tasks. In this study, we delve into …
Generative spoken language model based on continuous word-sized audio tokens
In NLP, text language models based on words or subwords are known to outperform their
character-based counterparts. Yet, in the speech community, the standard input of spoken …
character-based counterparts. Yet, in the speech community, the standard input of spoken …
Configurable privacy-preserving automatic speech recognition
Voice assistive technologies have given rise to far-reaching privacy and security concerns.
In this paper we investigate whether modular automatic speech recognition (ASR) can …
In this paper we investigate whether modular automatic speech recognition (ASR) can …
Self-supervised acoustic word embedding learning via correspondence transformer encoder
Acoustic word embeddings (AWEs) aims to map a variable-length speech segment into a
fixed-dimensional representation. High-quality AWEs should be invariant to variations, such …
fixed-dimensional representation. High-quality AWEs should be invariant to variations, such …
Supervised acoustic embeddings and their transferability across languages
In speech recognition, it is essential to model the phonetic content of the input signal while
discarding irrelevant factors such as speaker variations and noise, which is challenging in …
discarding irrelevant factors such as speaker variations and noise, which is challenging in …
Direct multimodal few-shot learning of speech and images
We propose direct multimodal few-shot models that learn a shared embedding space of
spoken words and images from only a few paired examples. Imagine an agent is shown an …
spoken words and images from only a few paired examples. Imagine an agent is shown an …