Google usm: Scaling automatic speech recognition beyond 100 languages
We introduce the Universal Speech Model (USM), a single large model that performs
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …
One-peace: Exploring one general representation model toward unlimited modalities
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …
Speak foreign languages with your own voice: Cross-lingual neural codec language modeling
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …
between any two languages? While recent breakthroughs in text-based models have …
Seamless: Multilingual Expressive and Streaming Speech Translation
Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …
mediated communication feel seamless when compared to human-to-human dialogue. In …
Speechlm: Enhanced speech pre-training with unpaired textual data
How to boost speech pre-training with textual data is an unsolved problem due to the fact
that speech and text are very different modalities with distinct characteristics. In this paper …
that speech and text are very different modalities with distinct characteristics. In this paper …
VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …
world, a more realistic speech interaction contains multimodal information, eg, vision, text …
MuSLAM: Multitask, Multilingual Speech and Language Models
Abstract We present Mu $^ 2$ SLAM, a multilingual sequence-to-sequence model pre-
trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic …
trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic …
CMOT: Cross-modal mixup via optimal transport for speech translation
End-to-end speech translation (ST) is the task of translating speech signals in the source
language into text in the target language. As a cross-modal task, end-to-end ST is difficult to …
language into text in the target language. As a cross-modal task, end-to-end ST is difficult to …
Dub: Discrete unit back-translation for speech translation
How can speech-to-text translation (ST) perform as well as machine translation (MT)? The
key point is to bridge the modality gap between speech and text so that useful MT …
key point is to bridge the modality gap between speech and text so that useful MT …