Speechx: Neural codec language model as a versatile speech transformer
Recent advancements in generative speech models based on audio-text prompts have
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
enabled remarkable innovations like high-quality zero-shot text-to-speech. However …
Overview of speaker modeling and its applications: From the lens of deep speaker representation learning
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …
By thoroughly and accurately modeling this information, it can be utilized in various …
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling
Language models have been effectively applied to modeling natural signals, such as
images, video, speech, and audio. A crucial component of these models is the codec …
images, video, speech, and audio. A crucial component of these models is the codec …
Codec-superb@ slt 2024: A lightweight benchmark for neural audio codec models
Neural audio codec models are becoming increasingly important as they serve as
tokenizers for audio, enabling efficient transmission or facilitating speech language …
tokenizers for audio, enabling efficient transmission or facilitating speech language …
The VoicePrivacy 2024 Challenge Evaluation Plan
The task of the challenge is to develop a voice anonymization system for speech data which
conceals the speaker's voice identity while protecting linguistic content and emotional states …
conceals the speaker's voice identity while protecting linguistic content and emotional states …
Autoregressive speech synthesis without vector quantization
We present MELLE, a novel continuous-valued tokens based language modeling approach
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel …
Amphion: an Open-Source Audio, Music, and Speech Generation Toolkit
Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to
ease the way for junior researchers and engineers into these fields. It presents a unified …
ease the way for junior researchers and engineers into these fields. It presents a unified …
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts
This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
autoregressive zero-shot text-to-speech system that offers human-level naturalness and …
F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …
flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as …
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Recent advancements in speech generation models have been significantly driven by the
use of large-scale training data. However, producing highly spontaneous, human-like …
use of large-scale training data. However, producing highly spontaneous, human-like …