Speak foreign languages with your own voice: Cross-lingual neural codec language modeling

Z Zhang, L Zhou, C Wang, S Chen, Y Wu, S Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual
speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec …

Reproducing whisper-style training using an open-source toolkit and publicly available data

Y Peng, J Tian, B Yan, D Berrebbi… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Pre-training speech models on large volumes of data has achieved remarkable success.
OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised …

End-to-end speech-to-text translation: A survey

N Sethiya, CK Maurya - Computer Speech & Language, 2024 - Elsevier
Abstract Speech-to-Text (ST) translation pertains to the task of converting speech signals in
one language to text in another language. It finds its application in various domains, such as …

M3ST: Mix at Three Levels for Speech Translation

X Cheng, Q Dong, F Yue, T Ko… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's
well known that data augmentation is an efficient method to improve performance for many …

Findings of the IWSLT 2023 evaluation campaign

M Agarwal, S Agarwal, A Anastasopoulos, L Bentivogli… - 2023 - um.edu.mt
This paper reports on the shared tasks organized by the 20th IWSLT Conference. The
shared tasks address 9 scientific challenges in spoken language translation: simultaneous …

Speech translation with large language models: An industrial practice

Z Huang, R Ye, T Ko, Q Dong, S Cheng… - arxiv preprint arxiv …, 2023 - arxiv.org
Given the great success of large language models (LLMs) across various tasks, in this
paper, we introduce LLM-ST, a novel and effective speech translation model constructed …

Vec-tok speech: speech vectorization and tokenization for neural speech generation

X Zhu, Y Lv, Y Lei, T Li, W He, H Zhou, H Lu… - arxiv preprint arxiv …, 2023 - arxiv.org
Language models (LMs) have recently flourished in natural language processing and
computer vision, generating high-fidelity texts or images in various tasks. In contrast, the …

On the effects of heterogeneous data sources on speech-to-text foundation models

J Tian, Y Peng, W Chen, K Choi, K Livescu… - arxiv preprint arxiv …, 2024 - arxiv.org
The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full
transparency in building advanced speech-to-text (S2T) foundation models. To this end …

[PDF][PDF] LAMASSU: A streaming language-agnostic multilingual speech recognition and translation model using neural transducers

P Wang, E Sun, J Xue, Y Wu, L Zhou, Y Gaur… - Proc …, 2023 - isca-archive.org
Automatic speech recognition (ASR) and speech translation (ST) can both use neural
transducers as the model structure. It is thus possible to use a single transducer model to …

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

D Wang, M Cui, D Yang, X Chen, H Meng - arxiv preprint arxiv …, 2024 - arxiv.org
With the rise of Speech Large Language Models (Speech LLMs), there has been growing
interest in discrete speech tokens for their ability to integrate with text-based tokens …