Maestro: Matched speech text representations through modality matching

Z Chen, Y Zhang, A Rosenberg… - arxiv preprint arxiv …, 2022 - arxiv.org
We present Maestro, a self-supervised training method to unify representations learnt from
speech and text modalities. Self-supervised learning from speech signals aims to learn the …

Dub: Discrete unit back-translation for speech translation

D Zhang, R Ye, T Ko, M Wang, Y Zhou - arxiv preprint arxiv:2305.11411, 2023 - arxiv.org
How can speech-to-text translation (ST) perform as well as machine translation (MT)? The
key point is to bridge the modality gap between speech and text so that useful MT …

Leveraging large text corpora for end-to-end speech summarization

K Matsuura, T Ashihara, T Moriya… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
End-to-end speech summarization (E2E SSum) is a technique to directly generate summary
sentences from speech. Compared with the cascade approach, which combines automatic …

Generating data with text-to-speech and large-language models for conversational speech recognition

S Cornell, J Darefsky, Z Duan, S Watanabe - arxiv preprint arxiv …, 2024 - arxiv.org
Currently, a common approach in many speech processing tasks is to leverage large scale
pre-trained models by fine-tuning them on in-domain data for a particular application. Yet …

Text-only domain adaptation for end-to-end asr using integrated text-to-mel-spectrogram generator

V Bataev, R Korostik, E Shabalin, V Lavrukhin… - arxiv preprint arxiv …, 2023 - arxiv.org
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained
on transcribed speech data, text-only data, or a mixture of both. The proposed model uses …

On the effect of purely synthetic training data for different automatic speech recognition architectures

B Hilmes, N Rossenbach - arxiv preprint arxiv:2407.17997, 2024 - arxiv.org
In this work we evaluate the utility of synthetic data for training automatic speech recognition
(ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to …

When whisper meets TTS: Domain adaptation using only synthetic speech data

JC Vásquez-Correa, H Arzelus… - … Conference on Text …, 2023 - Springer
Abstract Automatic Speech Recognition is among the most important areas of Artificial
Intelligence research today. One of the most notable advances in this area is the …

Investigating phoneme similarity with artificially accented speech

M Masson, J Carson-Berndsen - Proceedings of the 20th …, 2023 - aclanthology.org
While the deep learning revolution has led to significant performance improvements in
speech recognition, accented speech remains a challenge. Current approaches to this …

Towards Selection of Text-to-speech Data to Augment ASR Training

S Liu, L Sarı, C Wu, G Keren, Y Shangguan… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper presents a method for selecting appropriate synthetic speech samples from a
given large text-to-speech (TTS) dataset as supplementary training data for an automatic …

Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance

Y Perezhohin, T Santos, V Costa, F Peres… - IEEE …, 2024 - ieeexplore.ieee.org
This paper presents a novel methodology for enhancing Automatic Speech Recognition
(ASR) performance by utilizing contrastive learning to filter synthetic audio data. We address …