- Academic Search

Y Zhang, W Han, J Qin, Y Wang, A Bapna… - arxiv preprint arxiv …, 2023 - arxiv.org

We introduce the Universal Speech Model (USM), a single large model that performs
automatic speech recognition (ASR) across 100+ languages. This is achieved by pre …

保存引用被引用次数：298 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org

In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

保存引用被引用次数：121 相关文章所有 3 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling

Z Zhang, L Zhou, C Wang, S Chen, Y Wu, S Liu… - arxiv preprint arxiv …, 2023 - arxiv.org

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …

保存引用被引用次数：157 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org

What does it take to create the Babel Fish, a tool that can help individuals translate speech
between any two languages? While recent breakthroughs in text-based models have …

保存引用被引用次数：105 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Seamless: Multilingual Expressive and Streaming Speech Translation

L Barrault, YA Chung, MC Meglioli, D Dale… - arxiv preprint arxiv …, 2023 - arxiv.org

Large-scale automatic speech translation systems today lack key features that help machine-
mediated communication feel seamless when compared to human-to-human dialogue. In …

保存引用被引用次数：100 相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Speechlm: Enhanced speech pre-training with unpaired textual data

Z Zhang, S Chen, L Zhou, Y Wu, S Ren… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org

How to boost speech pre-training with textual data is an unsolved problem due to the fact
that speech and text are very different modalities with distinct characteristics. In this paper …

保存引用被引用次数：52 相关文章所有 6 个版本

[Free GPT-4]

[PDF] arxiv.org

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Q Zhu, L Zhou, Z Zhang, S Liu, B Jiao… - IEEE Transactions …, 2023 - ieeexplore.ieee.org

Although speech is a simple and effective way for humans to communicate with the outside
world, a more realistic speech interaction contains multimodal information, eg, vision, text …

保存引用被引用次数：36 相关文章所有 3 个版本

[Free GPT-4]

[PDF] mlr.press

MuSLAM: Multitask, Multilingual Speech and Language Models

Y Cheng, Y Zhang, M Johnson… - International …, 2023 - proceedings.mlr.press

Abstract We present Mu $^ 2$ SLAM, a multilingual sequence-to-sequence model pre-
trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic …

保存引用被引用次数：18 相关文章所有 8 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

CMOT: Cross-modal mixup via optimal transport for speech translation

Y Zhou, Q Fang, Y Feng - arxiv preprint arxiv:2305.14635, 2023 - arxiv.org

End-to-end speech translation (ST) is the task of translating speech signals in the source
language into text in the target language. As a cross-modal task, end-to-end ST is difficult to …

保存引用被引用次数：25 相关文章所有 5 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Dub: Discrete unit back-translation for speech translation

D Zhang, R Ye, T Ko, M Wang, Y Zhou - arxiv preprint arxiv:2305.11411, 2023 - arxiv.org

How can speech-to-text translation (ST) perform as well as machine translation (MT)? The
key point is to bridge the modality gap between speech and text so that useful MT …

保存引用被引用次数：25 相关文章所有 5 个版本 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text...

Google usm: Scaling automatic speech recognition beyond 100 languages

One-peace: Exploring one general representation model toward unlimited modalities

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

Seamless: Multilingual Expressive and Streaming Speech Translation

Speechlm: Enhanced speech pre-training with unpaired textual data

VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

MuSLAM: Multitask, Multilingual Speech and Language Models

CMOT: Cross-modal mixup via optimal transport for speech translation

Dub: Discrete unit back-translation for speech translation