An overview of voice conversion and its challenges: From statistical modeling to deep learning
Speaker identity is one of the important characteristics of human speech. In voice
conversion, we change the speaker identity from one to another, while kee** the linguistic …
conversion, we change the speaker identity from one to another, while kee** the linguistic …
A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions
S Ji, J Luo, X Yang - arxiv preprint arxiv:2011.06801, 2020 - arxiv.org
The utilization of deep learning techniques in generating various contents (such as image,
text, etc.) has become a trend. Especially music, the topic of this paper, has attracted …
text, etc.) has become a trend. Especially music, the topic of this paper, has attracted …
A survey on neural speech synthesis
Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural
speech given text, is a hot research topic in speech, language, and machine learning …
speech given text, is a hot research topic in speech, language, and machine learning …
Naturalspeech: End-to-end text-to-speech synthesis with human-level quality
Text-to-speech (TTS) has made rapid progress in both academia and industry in recent
years. Some questions naturally arise that whether a TTS system can achieve human-level …
years. Some questions naturally arise that whether a TTS system can achieve human-level …
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural
language processing models, we propose a unified-modal SpeechT5 framework that …
language processing models, we propose a unified-modal SpeechT5 framework that …
A comparative study on transformer vs rnn in speech applications
Sequence-to-sequence models have been widely used in end-to-end speech processing,
for example, automatic speech recognition (ASR), speech translation (ST), and text-to …
for example, automatic speech recognition (ASR), speech translation (ST), and text-to …
Attention, please! A survey of neural attention models in deep learning
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …
limited ability to process competing sources, attention mechanisms select, modulate, and …
ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit named ESPnet-
TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit …
TTS, which is an extension of the open-source speech processing toolkit ESPnet. The toolkit …
Moglow: Probabilistic and controllable motion synthesis using normalising flows
Data-driven modelling and synthesis of motion is an active research area with applications
that include animation, games, and social robotics. This paper introduces a new class of …
that include animation, games, and social robotics. This paper introduces a new class of …
Towards automatic face-to-face translation
In light of the recent breakthroughs in automatic machine translation systems, we propose a
novel approach that we term as" Face-to-Face Translation". As today's digital communication …
novel approach that we term as" Face-to-Face Translation". As today's digital communication …