Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

A survey on neural speech synthesis

X Tan, T Qin, F Soong, TY Liu - arxiv preprint arxiv:2106.15561, 2021 - arxiv.org
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …

Fastspeech 2: Fast and high-quality end-to-end text to speech

Y Ren, C Hu, X Tan, T Qin, S Zhao, Z Zhao… - arxiv preprint arxiv …, 2020 - arxiv.org
Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize
speech significantly faster than previous autoregressive models with comparable quality …

Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech

R Huang, Y Ren, J Liu, C Cui… - Advances in Neural …, 2022 - proceedings.neurips.cc
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples
with unseen style (eg, speaker identity, emotion, and prosody) derived from an acoustic …

Meta-stylespeech: Multi-speaker adaptive text-to-speech generation

D Min, DB Lee, E Yang… - … Conference on Machine …, 2021 - proceedings.mlr.press
With rapid progress in neural text-to-speech (TTS) models, personalized speech generation
is now in high demand for many applications. For practical applicability, a TTS model should …

Adaspeech: Adaptive text to speech for custom voice

M Chen, X Tan, B Li, Y Liu, T Qin, S Zhao… - arxiv preprint arxiv …, 2021 - arxiv.org
Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims
to adapt a source TTS model to synthesize personal voice for a target speaker using few …

Transformers in speech processing: A survey

S Latif, A Zaidi, H Cuayahuitl, F Shamshad… - arxiv preprint arxiv …, 2023 - arxiv.org
The remarkable success of transformers in the field of natural language processing has
sparked the interest of the speech-processing community, leading to an exploration of their …

Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus

R Huang, F Chen, Y Ren, J Liu, C Cui… - Proceedings of the 29th …, 2021 - dl.acm.org
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the
singing voice data shortage, limited singer generalization, and large computational cost …

Adaspeech 4: Adaptive text to speech in zero-shot scenarios

Y Wu, X Tan, B Li, L He, S Zhao, R Song, T Qin… - arxiv preprint arxiv …, 2022 - arxiv.org
Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently,
by using a well-trained source TTS model without adapting it on the speech data of new …

Prompttts 2: Describing and generating voices with text prompt

Y Leng, Z Guo, K Shen, X Tan, Z Ju, Y Liu, Y Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
Speech conveys more information than just text, as the same word can be uttered in various
voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods …