Google usm: Scaling automatic speech recognition beyond 100 languages

Y Zhang, W Han, J Qin, Y Wang, A Bapna… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce the Universal Speech Model (USM), a single large model that performs
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling

Z Zhang, L Zhou, C Wang, S Chen, Y Wu, S Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

Speechlm: Enhanced speech pre-training with unpaired textual data

Z Zhang, S Chen, L Zhou, Y Wu, S Ren… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
How to boost speech pre-training with textual data is an unsolved problem due to the fact
that speech and text are very different modalities with distinct characteristics. In this paper …

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

MuSLAM: Multitask, Multilingual Speech and Language Models

Y Cheng, Y Zhang, M Johnson… - International …, 2023 - proceedings.mlr.press
Abstract We present Mu $^ 2$ SLAM, a multilingual sequence-to-sequence model pre-
trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic …

CMOT: Cross-modal mixup via optimal transport for speech translation

Y Zhou, Q Fang, Y Feng - arxiv preprint arxiv:2305.14635, 2023 - arxiv.org
End-to-end speech translation (ST) is the task of translating speech signals in the source
language into text in the target language. As a cross-modal task, end-to-end ST is difficult to …

Dub: Discrete unit back-translation for speech translation

D Zhang, R Ye, T Ko, M Wang, Y Zhou - arxiv preprint arxiv:2305.11411, 2023 - arxiv.org
How can speech-to-text translation (ST) perform as well as machine translation (MT)? The
key point is to bridge the modality gap between speech and text so that useful MT …