Espnet-codec: Comprehensive training and evaluation of neural codecs for audio, music, and speech
Neural codecs have become crucial to recent speech and audio generation research. In
addition to signal compression capabilities, discrete codecs have also been found to …
addition to signal compression capabilities, discrete codecs have also been found to …
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Translation (AV2AV) framework where the input and output of the system are multimodal (ie …
Speechprompt: Prompting speech language models for speech processing tasks
Prompting has become a practical method for utilizing pre-trained language models (LMs).
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …
This approach offers several advantages. It allows an LM to adapt to new tasks with minimal …
Muskits-espnet: A comprehensive toolkit for singing voice synthesis in new paradigm
This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to
Singing Voice Synthesis (SVS) through the application of pretrained audio models in both …
Singing Voice Synthesis (SVS) through the application of pretrained audio models in both …
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation
This paper proposes a textless training method for many-to-many multilingual speech-to-
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …
speech translation that can also benefit the transfer of pre-trained knowledge to text-based …
Last: Language model aware speech tokenization
Speech tokenization serves as the foundation of speech language model (LM), enabling
them to perform various tasks such as spoken language modeling, text-to-speech, speech-to …
them to perform various tasks such as spoken language modeling, text-to-speech, speech-to …
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
Speech discrete representation has proven effective in various downstream applications
due to its superior compression rate of the waveform, fast convergence during training, and …
due to its superior compression rate of the waveform, fast convergence during training, and …
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Representing speech and audio signals in discrete units has become a compelling
alternative to traditional high-dimensional feature vectors. Numerous studies have …
alternative to traditional high-dimensional feature vectors. Numerous studies have …
A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models
With the rise of Speech Large Language Models (Speech LLMs), there has been growing
interest in discrete speech tokens for their ability to integrate with text-based tokens …
interest in discrete speech tokens for their ability to integrate with text-based tokens …
SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models
Discrete representation has shown advantages in speech generation tasks, wherein
discrete tokens are derived by discretizing hidden features from self-supervised learning …
discrete tokens are derived by discretizing hidden features from self-supervised learning …