Parallel learning: Overview and perspective for computational learning across Syn2Real and Sim2Real

Q Miao, Y Lv, M Huang, X Wang… - IEEE/CAA Journal of …, 2023 - ieeexplore.ieee.org
The virtual-to-real paradigm, ie, training models on virtual data and then applying them to
solve real-world problems, has attracted more and more attention from various domains by …

On the adoption of modern technologies to fight the COVID-19 pandemic: a technical synthesis of latest developments

A Majeed, X Zhang - COVID, 2023 - mdpi.com
In the ongoing COVID-19 pandemic, digital technologies have played a vital role to minimize
the spread of COVID-19, and to control its pitfalls for the general public. Without such …

A semi-supervised complementary joint training approach for low-resource speech recognition

YQ Du, J Zhang, X Fang, MH Wu… - IEEE/ACM Transactions …, 2023 - ieeexplore.ieee.org
Both unpaired speech and text have shown to be beneficial for low-resource automatic
speech recognition (ASR), which, however were either separately used for pre-training, self …

Generating data with text-to-speech and large-language models for conversational speech recognition

S Cornell, J Darefsky, Z Duan, S Watanabe - arxiv preprint arxiv …, 2024 - arxiv.org
Currently, a common approach in many speech processing tasks is to leverage large scale
pre-trained models by fine-tuning them on in-domain data for a particular application. Yet …

On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition

N Rossenbach, R Schlüter, S Sakti - arxiv preprint arxiv:2407.21476, 2024 - arxiv.org
The rapid development of neural text-to-speech (TTS) systems enabled its usage in other
areas of natural language processing such as automatic speech recognition (ASR) or …

Text is all you need: Personalizing ASR models using controllable speech synthesis

K Yang, TY Hu, JHR Chang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Adapting generic speech recognition models to specific individuals is a challenging problem
due to the scarcity of personalized data. Recent works have proposed boosting the amount …

Phoneme hallucinator: One-shot voice conversion via set expansion

S Shan, Y Li, A Banerjee, JB Oliva - … of the AAAI Conference on Artificial …, 2024 - ojs.aaai.org
Voice conversion (VC) aims at altering a person's voice to make it sound similar to the voice
of another person while preserving linguistic content. Existing methods suffer from a …

Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition

S Ueno, A Lee, T Kawahara - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
While end-to-end automatic speech recognition (ASR) has shown impressive performance,
it requires a huge amount of speech and transcription data. The conversion of domain …

Can we use Common Voice to train a Multi-Speaker TTS system?

S Ogun, V Colotte, E Vincent - 2022 IEEE Spoken Language …, 2023 - ieeexplore.ieee.org
Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on
high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are …

On the effect of purely synthetic training data for different automatic speech recognition architectures

B Hilmes, N Rossenbach - arxiv preprint arxiv:2407.17997, 2024 - arxiv.org
In this work we evaluate the utility of synthetic data for training automatic speech recognition
(ASR). We use the ASR training data to train a text-to-speech (TTS) system similar to …