A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - International Journal of …, 2024 - Springer
Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …

Conventional and contemporary approaches used in text to speech synthesis: A review

N Kaur, P Singh - Artificial Intelligence Review, 2023 - Springer
Nowadays speech synthesis or text to speech (TTS), an ability of system to produce human
like natural sounding voice from the written text, is gaining popularity in the field of speech …

Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Libritts: A corpus derived from librispeech for text-to-speech

H Zen, V Dang, R Clark, Y Zhang, RJ Weiss… - arxiv preprint arxiv …, 2019 - arxiv.org
This paper introduces a new speech corpus called" LibriTTS" designed for text-to-speech
use. It is derived from the original audio and text materials of the LibriSpeech corpus, which …

An unsupervised autoregressive model for speech representation learning

YA Chung, WN Hsu, H Tang, J Glass - arxiv preprint arxiv:1904.03240, 2019 - arxiv.org
This paper proposes a novel unsupervised autoregressive neural model for learning generic
speech representations. In contrast to other speech representation learning methods that …

Generative pre-training for speech with autoregressive predictive coding

YA Chung, J Glass - ICASSP 2020-2020 IEEE International …, 2020 - ieeexplore.ieee.org
Learning meaningful and general representations from unannotated speech that are
applicable to a wide range of tasks remains challenging. In this paper we propose to use …

A review of deep learning based speech synthesis

Y Ning, S He, Z Wu, C **ng, LJ Zhang - Applied Sciences, 2019 - mdpi.com
Speech synthesis, also known as text-to-speech (TTS), has attracted increasingly more
attention. Recent advances on speech synthesis are overwhelmingly contributed by deep …

Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis

G Sun, Y Zhang, RJ Weiss, Y Cao… - ICASSP 2020-2020 …, 2020 - ieeexplore.ieee.org
This paper proposes a hierarchical, fine-grained and interpretable latent variable model for
prosody based on the Tacotron 2 text-to-speech model. It achieves multi-resolution …

Lrspeech: Extremely low-resource speech synthesis and recognition

J Xu, X Tan, Y Ren, T Qin, J Li, S Zhao… - Proceedings of the 26th …, 2020 - dl.acm.org
Speech synthesis (text to speech, TTS) and recognition (automatic speech recognition, ASR)
are important speech tasks, and require a large amount of text and speech pairs for model …