Scaling speech technology to 1,000+ languages

V Pratap, A Tjandra, B Shi, P Tomasello, A Babu… - Journal of Machine …, 2024 - jmlr.org
Expanding the language coverage of speech technology has the potential to improve
access to information for many more people. However, current speech technology is …

Audiolm: a language modeling approach to audio generation

Z Borsos, R Marinier, D Vincent… - … ACM transactions on …, 2023 - ieeexplore.ieee.org
We introduce AudioLM, a framework for high-quality audio generation with long-term
consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts …

[HTML][HTML] Unsupervised automatic speech recognition: A review

H Aldarmaki, A Ullah, S Ram, N Zaki - Speech Communication, 2022 - Elsevier
Abstract Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but large …

Contentvec: An improved self-supervised speech representation by disentangling speakers

K Qian, Y Zhang, H Gao, J Ni, CI Lai… - International …, 2022 - proceedings.mlr.press
Self-supervised learning in speech involves training a speech representation network on a
large-scale unannotated speech corpus, and then applying the learned representations to …

Moshi: a speech-text foundation model for real-time dialogue

A Défossez, L Mazaré, M Orsini, A Royer… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Moshi, a speech-text foundation model and full-duplex spoken dialogue
framework. Current systems for spoken dialogue rely on pipelines of independent …

Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge

E Dunbar, N Hamilakis… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent progress in self-supervised or unsupervised machine learning has opened the
possibility of building a full speech processing system from raw audio without using any …

Analyzing speaker information in self-supervised models to improve zero-resource speech processing

B van Niekerk, L Nortje, M Baas, H Kamper - arxiv preprint arxiv …, 2021 - arxiv.org
Contrastive predictive coding (CPC) aims to learn representations of speech by
distinguishing future observations from a set of negative examples. Previous work has …

Are discrete units necessary for spoken language modeling?

TA Nguyen, B Sagot, E Dupoux - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Recent work in spoken language modeling shows the possibility of learning a language
unsupervisedly from raw audio without any text labels. The approach relies first on …

SpeechGLUE: How well can self-supervised speech models capture linguistic knowledge?

T Ashihara, T Moriya, K Matsuura, T Tanaka… - arxiv preprint arxiv …, 2023 - arxiv.org
Self-supervised learning (SSL) for speech representation has been successfully applied in
various downstream tasks, such as speech and speaker recognition. More recently, speech …

Leveraging the Multilingual Indonesian Ethnic Languages Dataset In Self-Supervised Models for Low-Resource ASR Task

S Sakti, BA Titalim - 2023 IEEE Automatic Speech Recognition …, 2023 - ieeexplore.ieee.org
Indonesia is home to roughly 700 languages, which amounts to about ten percent of the
global total, positioning it as the second-most linguistically diverse country after Papua New …