Overview of speaker modeling and its applications: From the lens of deep speaker representation learning
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …
By thoroughly and accurately modeling this information, it can be utilized in various …
Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …
addition to signal compression capabilities, discrete codecs have also been found to …
VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation
Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of
deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled …
deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled …
Preference Alignment Improves Language Model-Based TTS
Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based
systems offer competitive performance to their counterparts. Further optimization can be …
systems offer competitive performance to their counterparts. Further optimization can be …
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for
various speech, audio, and music signals. The toolkit features a Pythonic interface with …
various speech, audio, and music signals. The toolkit features a Pythonic interface with …
Recent Advances in Discrete Speech Tokens: A Review
Y Guo, Z Li, H Wang, B Li, C Shao, H Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
The rapid advancement of speech generation technologies in the era of large language
models (LLMs) has established discrete speech tokens as a foundational paradigm for …
models (LLMs) has established discrete speech tokens as a foundational paradigm for …
vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice
conversion (VC). We use discrete tokens from speech self-supervised models as the content …
conversion (VC). We use discrete tokens from speech self-supervised models as the content …
Recursive Attentive Pooling For Extracting Speaker Embeddings From Multi-Speaker Recordings
This paper proposes a method for extracting speaker embedding for each speaker from a
variable-length recording containing multiple speakers. Speaker embeddings are crucial not …
variable-length recording containing multiple speakers. Speaker embeddings are crucial not …
LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec
Although discrete speech tokens have exhibited strong potential for language model-based
speech generation, their high bitrates and redundant timbre information restrict the …
speech generation, their high bitrates and redundant timbre information restrict the …
ESPnet-EZ: Python-Only ESPnet For Easy Fine-Tuning And Integration
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit
ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on …
ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on …