Speechx: Neural codec language model as a versatile speech transformer

X Wang, M Thakker, Z Chen, N Kanda… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …

Overview of speaker modeling and its applications: From the lens of deep speaker representation learning

S Wang, Z Chen, KA Lee, Y Qian… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

S Ji, Z Jiang, W Wang, Y Chen, M Fang, J Zuo… - arxiv preprint arxiv …, 2024 - arxiv.org
Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …

Codec-superb@ slt 2024: A lightweight benchmark for neural audio codec models

H Wu, X Chen, YC Lin, K Chang, J Du… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Neural audio codec models are becoming increasingly important as they serve as
tokenizers for audio, enabling efficient transmission or facilitating speech language …

The VoicePrivacy 2024 Challenge Evaluation Plan

N Tomashenko, X Miao, P Champion, S Meyer… - arxiv preprint arxiv …, 2024 - arxiv.org
The task of the challenge is to develop a voice anonymization system for speech data which
conceals the speaker's voice identity while protecting linguistic content and emotional states …

Autoregressive speech synthesis without vector quantization

L Meng, L Zhou, S Liu, S Chen, B Han, S Hu… - arxiv preprint arxiv …, 2024 - arxiv.org
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …

Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit

X Zhang, L Xue, Y Gu, Y Wang, J Li… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to
ease the way for junior researchers and engineers into these fields. It presents a unified …

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts

SE Eskimez, X Wang, M Thakker, C Li… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …

F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching

Y Chen, Z Niu, Z Ma, K Deng, C Wang, J Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

H He, Z Shang, C Wang, X Li, Y Gu… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Recent advancements in speech generation models have been significantly driven by the
use of large-scale training data. However, producing highly spontaneous, human-like …