Overview of speaker modeling and its applications: From the lens of deep speaker representation learning

S Wang, Z Chen, KA Lee, Y Qian… - IEEE/ACM Transactions …, 2024 - ieeexplore.ieee.org
Speaker individuality information is among the most critical elements within speech signals.
By thoroughly and accurately modeling this information, it can be utilized in various …

Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech

J Shi, J Tian, Y Wu, J Jung, JQ Yip… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Y Yu, J Shi, Y Wu, Y Tang… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of
deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled …

Preference Alignment Improves Language Model-Based TTS

J Tian, C Zhang, J Shi, H Zhang, J Yu… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based
systems offer competitive performance to their counterparts. Further optimization can be …

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

J Shi, H Shim, J Tian, S Arora, H Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for
various speech, audio, and music signals. The toolkit features a Pythonic interface with …

Recent Advances in Discrete Speech Tokens: A Review

Y Guo, Z Li, H Wang, B Li, C Shao, H Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
The rapid advancement of speech generation technologies in the era of large language
models (LLMs) has established discrete speech tokens as a foundational paradigm for …

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

Y Guo, Z Li, J Li, C Du, H Wang, S Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose a new speech discrete token vocoder, vec2wav 2.0, which advances voice
conversion (VC). We use discrete tokens from speech self-supervised models as the content …

Recursive Attentive Pooling For Extracting Speaker Embeddings From Multi-Speaker Recordings

S Horiguchi, A Ando, T Moriya… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
This paper proposes a method for extracting speaker embedding for each speaker from a
variable-length recording containing multiple speakers. Speaker embeddings are crucial not …

LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec

Y Guo, Z Li, C Du, H Wang, X Chen, K Yu - arxiv preprint arxiv …, 2024 - arxiv.org
Although discrete speech tokens have exhibited strong potential for language model-based
speech generation, their high bitrates and redundant timbre information restrict the …

ESPnet-EZ: Python-Only ESPnet For Easy Fine-Tuning And Integration

M Someki, K Choi, S Arora, W Chen… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit
ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on …