Self-supervised speech representation learning: A review

A Mohamed, H Lee, L Borgholt… - IEEE Journal of …, 2022‏ - ieeexplore.ieee.org
Although supervised deep learning has revolutionized speech and audio processing, it has
necessitated the building of specialist models for individual tasks and application scenarios …

Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward

M Masood, M Nawaz, KM Malik, A Javed, A Irtaza… - Applied …, 2023‏ - Springer
Easy access to audio-visual content on social media, combined with the availability of
modern tools such as Tensorflow or Keras, and open-source trained models, along with …

Unsupervised learning of semantic audio representations

A Jansen, M Plakal, R Pandya… - … on acoustics, speech …, 2018‏ - ieeexplore.ieee.org
Even in the absence of any explicit semantic annotation, vast collections of audio recordings
provide valuable information for learning the categorical structure of sounds. We consider …

The zero resource speech challenge 2017

E Dunbar, XN Cao, J Benjumea… - 2017 IEEE Automatic …, 2017‏ - ieeexplore.ieee.org
We describe a new challenge aimed at discovering subword and word units from raw
speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It …

Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner

E Dupoux - Cognition, 2018‏ - Elsevier
Spectacular progress in the information processing sciences (machine learning, wearable
sensors) promises to revolutionize the study of cognitive development. Here, we analyse the …

Word discovery in visually grounded, self-supervised speech models

P Peng, D Harwath - arxiv preprint arxiv:2203.15081, 2022‏ - arxiv.org
We present a method for visually-grounded spoken term discovery. After training either a
HuBERT or wav2vec2. 0 model to associate spoken captions with natural images, we show …

Learning hierarchical discrete linguistic units from visually-grounded speech

D Harwath, WN Hsu, J Glass - arxiv preprint arxiv:1911.09602, 2019‏ - arxiv.org
In this paper, we present a method for learning discrete linguistic units by incorporating
vector quantization layers into neural models of visually grounded speech. We show that our …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022‏ - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

A segmental framework for fully-unsupervised large-vocabulary speech recognition

H Kamper, A Jansen, S Goldwater - Computer Speech & Language, 2017‏ - Elsevier
Zero-resource speech technology is a growing research area that aims to develop methods
for speech processing in the absence of transcriptions, lexicons, or language modelling text …

VQVAE unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019

A Tjandra, B Sisman, M Zhang, S Sakti, H Li… - arxiv preprint arxiv …, 2019‏ - arxiv.org
We describe our submitted system for the ZeroSpeech Challenge 2019. The current
challenge theme addresses the difficulty of constructing a speech synthesizer without any …