Generative adversarial networks for speech processing: A review

A Wali, Z Alamgir, S Karim, A Fawaz, MB Ali… - Computer Speech & …, 2022 - Elsevier
Generative adversarial networks (GANs) have seen remarkable progress in recent years.
They are used as generative models for all kinds of data such as text, images, audio, music …

TTS-by-TTS: TTS-driven data augmentation for fast and high-quality speech synthesis

MJ Hwang, R Yamamoto, E Song… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
In this paper, we propose a text-to-speech (TTS)-driven data augmentation method for
improving the quality of a non-autoregressive (AR) TTS system. Recently proposed non-AR …

Language model-based emotion prediction methods for emotional speech synthesis systems

HW Yoon, O Kwon, H Lee, R Yamamoto… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained
language model (LM)-based emotion prediction method. Unlike conventional systems that …

Modeling and driving human body soundfields through acoustic primitives

C Huang, D Marković, C Xu, A Richard - European Conference on …, 2024 - Springer
While rendering and animation of photorealistic 3D human body models have matured and
reached an impressive quality over the past years, modeling the spatial audio associated …

N-singer: A non-autoregressive korean singing voice synthesis system for pronunciation enhancement

GH Lee, TW Kim, H Bae, MJ Lee, YI Kim… - arxiv preprint arxiv …, 2021 - arxiv.org
Recently, end-to-end Korean singing voice systems have been designed to generate
realistic singing voices. However, these systems still suffer from a lack of robustness in terms …

High-fidelity and pitch-controllable neural vocoder based on unified source-filter networks

R Yoneyama, YC Wu, T Toda - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
We introduce unified source-filter generative adversarial networks (uSFGAN), a waveform
generative model conditioned on acoustic features, which represents the source-filter …

Improved Parallel WaveGAN vocoder with perceptually weighted spectrogram loss

E Song, R Yamamoto, MJ Hwang… - 2021 IEEE Spoken …, 2021 - ieeexplore.ieee.org
This paper proposes a spectral-domain perceptual weighting technique for Parallel
WaveGAN-based text-to-speech (TTS) systems. The recently proposed Parallel WaveGAN …

Sounding bodies: modeling 3D spatial sound of humans using body pose and audio

X Xu, D Markovic, J Sandakly… - Advances in …, 2024 - proceedings.neurips.cc
While 3D human body modeling has received much attention in computer vision, modeling
the acoustic equivalent, ie modeling 3D spatial audio produced by body motion and speech …

[PDF][PDF] High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model.

MJ Hwang, R Yamamoto, E Song, JM Kim - Interspeech, 2021 - sewplay.github.io
This paper proposes a multi-band harmonic-plus-noise (HN) Parallel WaveGAN (PWG)
vocoder. To generate a highfidelity speech signal, it is important to well-reflect the harmonic …

Phrase break prediction with bidirectional encoder representations in Japanese text-to-speech synthesis

K Futamata, B Park, R Yamamoto… - arxiv preprint arxiv …, 2021 - arxiv.org
We propose a novel phrase break prediction method that combines implicit features
extracted from a pre-trained large language model, aka BERT, and explicit features …