Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech

J Shi, J Tian, Y Wu, J Jung, JQ Yip… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …

Preference tuning with human feedback on language, speech, and vision tasks: A survey

GI Winata, H Zhao, A Das, W Tang, DD Yao… - arxiv preprint arxiv …, 2024 - arxiv.org
Preference tuning is a crucial process for aligning deep generative models with human
preferences. This survey offers a thorough overview of recent advancements in preference …

URGENT challenge: Universality, robustness, and generalizability for speech enhancement

W Zhang, R Scheibler, K Saijo, S Cornell, C Li… - arxiv preprint arxiv …, 2024 - arxiv.org
The last decade has witnessed significant advancements in deep learning-based speech
enhancement (SE). However, most existing SE research has limitations on the coverage of …

Mos-bench: Benchmarking generalization abilities of subjective speech quality assessment models

WC Huang, E Cooper, T Toda - arxiv preprint arxiv:2411.03715, 2024 - arxiv.org
Subjective speech quality assessment (SSQA) is critical for evaluating speech samples as
perceived by human listeners. While model-based SSQA has enjoyed great success thanks …

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music

J Shi, H Shim, J Tian, S Arora, H Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for
various speech, audio, and music signals. The toolkit features a Pythonic interface with …

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

S Wang, W Yu, Y Yang, C Tang, Y Li, J Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org
Speech quality assessment typically requires evaluating audio from multiple aspects, such
as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to …

Massively Multilingual Forced Aligner Leveraging Self-Supervised Discrete Units

H Inaguma, I Kulikov, Z Ni, S Popuri… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
We propose a massively multilingual speech-to-text neural forced aligner that supports 98
languages with a single architecture. The aligner takes self-supervised discrete acoustic …

Deep Speech Synthesis from Multimodal Articulatory Representations

P Wu, B Yu, K Scheck, AW Black… - arxiv preprint arxiv …, 2024 - arxiv.org
The amount of articulatory data available for training deep learning models is much less
compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis …

AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

J Zhang, J Yang, Z Fang, Y Wang, Z Zhang… - arxiv preprint arxiv …, 2025 - arxiv.org
We introduce AnyEnhance, a unified generative model for voice enhancement that
processes both speech and singing voices. Based on a masked generative model …

BnTTS: Few-Shot Speaker Adaptation in Low-Resource Setting

MJI Basher, M Kowsher, MS Islam, RN Nandi… - arxiv preprint arxiv …, 2025 - arxiv.org
This paper introduces BnTTS (Bangla Text-To-Speech), the first framework for Bangla
speaker adaptation-based TTS, designed to bridge the gap in Bangla speech synthesis …