Voicebox: Text-guided multilingual universal speech generation at scale

M Le, A Vyas, B Shi, B Karrer, L Sari… - Advances in neural …, 2023 - proceedings.neurips.cc
Large-scale generative models such as GPT and DALL-E have revolutionized the research
community. These models not only generate high fidelity outputs, but are also generalists …

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

Textually pretrained speech language models

M Hassid, T Remez, TA Nguyen, I Gat… - Advances in …, 2023 - proceedings.neurips.cc
Speech language models (SpeechLMs) process and generate acoustic data only, without
textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using …

Textless speech-to-speech translation on real data

A Lee, H Gong, PA Duquenne, H Schwenk… - arxiv preprint arxiv …, 2021 - arxiv.org
We present a textless speech-to-speech translation (S2ST) system that can translate speech
from one language into another language and can be built without the need of any text data …

Espnet2-tts: Extending the edge of tts research

T Hayashi, R Yamamoto, T Yoshimura, P Wu… - arxiv preprint arxiv …, 2021 - arxiv.org
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit.
ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features …

Improving grammatical error correction with multimodal feature integration

T Fang, J Hu, DF Wong, X Wan, LS Chao… - Findings of the …, 2023 - aclanthology.org
Grammatical error correction (GEC) is a promising task aimed at correcting errors in a text.
Many methods have been proposed to facilitate this task with remarkable results. However …

Speaking style conversion in the waveform domain using discrete self-supervised units

G Maimon, Y Adi - arxiv preprint arxiv:2212.09730, 2022 - arxiv.org
We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and
timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice …

Phonetic analysis of self-supervised representations of english speech

D Wells, H Tang, K Richmond - 23rd Annual Conference of the …, 2022 - research.ed.ac.uk
We present an analysis of discrete units discovered via selfsupervised representation
learning on English speech. We focus on units produced by a pre-trained HuBERT model …

A holistic cascade system, benchmark, and human evaluation protocol for expressive speech-to-speech translation

WC Huang, B Peloquin, J Kao, C Wang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of
source speech to target speech while maintaining translation accuracy. Existing research in …

Scaling properties of speech language models

S Cuervo, R Marxer - arxiv preprint arxiv:2404.00685, 2024 - arxiv.org
Speech Language Models (SLMs) aim to learn language from raw audio, without textual
resources. Despite significant advances, our current models exhibit weak syntax and …