Neural codec language models are zero-shot text to speech synthesizers

C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called Vall-E) using discrete codes derived from …

Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers

S Chen, S Liu, L Zhou, Y Liu, X Tan, J Li, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces VALL-E 2, the latest advancement in neural codec language models
that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity …

Onefi: One-shot recognition for unseen gesture via cots wifi

R **ao, J Liu, J Han, K Ren - Proceedings of the 19th ACM Conference …, 2021 - dl.acm.org
WiFi-based Human Gesture Recognition (HGR) becomes increasingly promising for device-
free human-computer interaction. However, existing WiFi-based approaches have not been …

Meta-tts: Meta-learning for few-shot speaker adaptive text-to-speech

SF Huang, CJ Lin, DR Liu, YC Chen… - IEEE/ACM Transactions …, 2022 - ieeexplore.ieee.org
Personalizing a speech synthesis system is a highly desired application, where the system
can generate speech with the user's voice with rare enrolled recordings. There are two main …

Usat: A universal speaker-adaptive text-to-speech approach

W Wang, Y Song, S Jha - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Conventional text-to-speech (TTS) research has predominantly focused on enhancing the
quality of synthesized speech for speakers in the training dataset. The challenge of …

The multi-speaker multi-style voice cloning challenge 2021

Q **e, X Tian, G Liu, K Song, L **e, Z Wu… - ICASSP 2021-2021 …, 2021 - ieeexplore.ieee.org
The Multi-speaker Multi-style Voice Cloning Challenge (M2VoC) aims to provide a common
sizable dataset as well as a fair testbed for the benchmarking of the popular voice cloning …

Voice filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

A Gabryś, G Huybrechts, MS Ribeiro… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data
to generate high-quality synthetic speech. When using reduced amounts of training data …

Neural codec language models are zero-shot text to speech synthesizers

S Chen, C Wang, Y Wu, Z Zhang, L Zhou… - … on Audio, Speech …, 2025 - ieeexplore.ieee.org
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically,
we train a neural codec language model (called VALL-E) using discrete codes derived from …

Adversarially learning disentangled speech representations for robust multi-factor voice conversion

J Wang, J Li, X Zhao, Z Wu, S Kang, H Meng - arxiv preprint arxiv …, 2021 - arxiv.org
Factorizing speech as disentangled speech representations is vital to achieve highly
controllable style transfer in voice conversion (VC). Conventional speech representation …

Takin-vc: Zero-shot voice conversion via jointly hybrid content and memory-augmented context-aware timbre modeling

Y Yang, Y Pan, J Yao, X Zhang, J Ye, H Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Zero-shot voice conversion (VC) aims to transform the source speaker timbre into an
arbitrary unseen one without altering the original speech content. While recent …