Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
Fastspeech 2: Fast and high-quality end-to-end text to speech
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize
speech significantly faster than previous autoregressive models with comparable quality …
speech significantly faster than previous autoregressive models with comparable quality …
Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …
Meta-stylespeech: Multi-speaker adaptive text-to-speech generation
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation
is now in high demand for many applications. For practical applicability, a TTS model should …
is now in high demand for many applications. For practical applicability, a TTS model should …
Adaspeech: Adaptive text to speech for custom voice
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims
to adapt a source TTS model to synthesize personal voice for a target speaker using few …
to adapt a source TTS model to synthesize personal voice for a target speaker using few …
Transformers in speech processing: A survey
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …
sparked the interest of the speech-processing community, leading to an exploration of their …
Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the
singing voice data shortage, limited singer generalization, and large computational cost …
singing voice data shortage, limited singer generalization, and large computational cost …
Adaspeech 4: Adaptive text to speech in zero-shot scenarios
Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently,
by using a well-trained source TTS model without adapting it on the speech data of new …
by using a well-trained source TTS model without adapting it on the speech data of new …
Prompttts 2: Describing and generating voices with text prompt
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …