Audiolm: a language modeling approach to audio generation

Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org
We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

C Wang, M Riviere, A Lee, A Wu, C Talnikar… - arxiv preprint arxiv …, 2021 - arxiv.org
We introduce VoxPopuli, a large-scale multilingual corpus providing 100K hours of
unlabelled speech data in 23 languages. It is the largest open data to date for unsupervised …

An unsupervised autoregressive model for speech representation learning

YA Chung, WN Hsu, H Tang, J Glass - arxiv preprint arxiv:1904.03240, 2019 - arxiv.org
This paper proposes a novel unsupervised autoregressive neural model for learning generic
speech representations. In contrast to other speech representation learning methods that …

Unsupervised speech representation learning using wavenet autoencoders

J Chorowski, RJ Weiss, S Bengio… - … /ACM transactions on …, 2019 - ieeexplore.ieee.org
We consider the task of unsupervised extraction of meaningful latent representations of
speech by applying autoencoding neural networks to speech waveforms. The goal is to …

Unsupervised pretraining transfers well across languages

M Riviere, A Joulin, PE Mazaré… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
Cross-lingual and multi-lingual training of Automatic Speech Recognition (ASR) has been
extensively investigated in the supervised setting. This assumes the existence of a parallel …

Libri-light: A benchmark for asr with limited or no supervision

J Kahn, M Riviere, W Zheng… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
We introduce a new collection of spoken English audio suitable for training speech
recognition systems under limited or no supervision. It is derived from open-source audio …

Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge

B Van Niekerk, L Nortje, H Kamper - arxiv preprint arxiv:2005.09409, 2020 - arxiv.org
In this paper, we explore vector quantization for acoustic unit discovery. Leveraging
unlabelled data, we aim to learn discrete representations of speech that separate phonetic …

The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

TA Nguyen, M de Seyssel, P Rozé, M Rivière… - arxiv preprint arxiv …, 2020 - arxiv.org
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic
representations from raw audio signals without any labels, along with the Zero Resource …

The zero resource speech challenge 2017

E Dunbar, XN Cao, J Benjumea… - 2017 IEEE Automatic …, 2017 - ieeexplore.ieee.org
We describe a new challenge aimed at discovering subword and word units from raw
speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It …