Generative adversarial networks for speech processing: A review
Generative adversarial networks (GANs) have seen remarkable progress in recent years.
They are used as generative models for all kinds of data such as text, images, audio, music …
They are used as generative models for all kinds of data such as text, images, audio, music …
TTS-by-TTS: TTS-driven data augmentation for fast and high-quality speech synthesis
In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for
improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR …
improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR …
Language model-based emotion prediction methods for emotional speech synthesis systems
This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained
language model (LM)-based emotion prediction method. Unlike conventional systems that …
language model (LM)-based emotion prediction method. Unlike conventional systems that …
Modeling and driving human body soundfields through acoustic primitives
While rendering and animation of photorealistic 3D human body models have matured and
reached an impressive quality over the past years, modeling the spatial audio associated …
reached an impressive quality over the past years, modeling the spatial audio associated …
N-singer: A non-autoregressive korean singing voice synthesis system for pronunciation enhancement
Recently, end-to-end Korean singing voice systems have been designed to generate
realistic singing voices. However, these systems still suffer from a lack of robustness in terms …
realistic singing voices. However, these systems still suffer from a lack of robustness in terms …
High-fidelity and pitch-controllable neural vocoder based on unified source-filter networks
We introduce unified source-filter generative adversarial networks (uSFGAN), a waveform
generative model conditioned on acoustic features, which represents the source-filter …
generative model conditioned on acoustic features, which represents the source-filter …
Improved Parallel WaveGAN vocoder with perceptually weighted spectrogram loss
This paper proposes a spectral-domain perceptual weighting technique for Parallel
WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN …
WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN …
Sounding bodies: modeling 3D spatial sound of humans using body pose and audio
While 3D human body modeling has received much attention in computer vision, modeling
the acoustic equivalent, ie modeling 3D spatial audio produced by body motion and speech …
the acoustic equivalent, ie modeling 3D spatial audio produced by body motion and speech …
[PDF][PDF] High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model.
This paper proposes a multi-band harmonic-plus-noise (HN) Parallel WaveGAN (PWG)
vocoder. To generate a highfidelity speech signal, it is important to well-reflect the harmonic …
vocoder. To generate a highfidelity speech signal, it is important to well-reflect the harmonic …
Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis
We propose a novel phrase break prediction method that combines implicit features
extracted from a pre-trained large language model, aka BERT, and explicit features …
extracted from a pre-trained large language model, aka BERT, and explicit features …