Freevc: Towards high-quality text-free one-shot voice conversion

J Li, W Tu, L **ao - ICASSP 2023-2023 IEEE International …, 2023 - ieeexplore.ieee.org
Voice conversion (VC) can be achieved by first extracting source content information and
target speaker information, and then reconstructing waveform with these information …

SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities

HS Tsai, HJ Chang, WC Huang, Z Huang… - arxiv preprint arxiv …, 2022 - arxiv.org
Transfer learning has proven to be crucial in advancing the state of speech and natural
language processing research in recent years. In speech, a model pre-trained by self …

A large-scale evaluation of speech foundation models

S Yang, HJ Chang, Z Huang, AT Liu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
The foundation model paradigm leverages a shared foundation model to achieve state-of-
the-art (SOTA) performance for various tasks, requiring minimal downstream-specific data …

From discrete tokens to high-fidelity audio using multi-band diffusion

R San Roman, Y Adi, A Deleforge… - Advances in …, 2023 - proceedings.neurips.cc
Deep generative models can generate high-fidelity audio conditioned on varioustypes of
representations (eg, mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)) …

Parp: Prune, adjust and re-prune for self-supervised speech recognition

CIJ Lai, Y Zhang, AH Liu, S Chang… - Advances in …, 2021 - proceedings.neurips.cc
Self-supervised speech representation learning (speech SSL) has demonstrated the benefit
of scale in learning rich representations for Automatic Speech Recognition (ASR) with …

Dddm-vc: Decoupled denoising diffusion models with disentangled representation and prior mixup for verified robust voice conversion

HY Choi, SH Lee, SW Lee - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Diffusion-based generative models have recently exhibited powerful generative
performance. However, as many attributes exist in the data distribution and owing to several …

Zmm-tts: Zero-shot multilingual and multispeaker speech synthesis conditioned on self-supervised discrete speech representations

C Gong, X Wang, E Cooper, D Wells… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker,
single-language synthesis. Multilingual TTS systems are limited to resource-rich languages …

Self-supervised asr models and features for dysarthric and elderly speech recognition

S Hu, X **e, M Geng, Z **, J Deng, G Li… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Self-supervised learning (SSL) based speech foundation models have been applied to a
wide range of ASR tasks. However, their application to dysarthric and elderly speech via …

Efficient domain adaptation for speech foundation models

B Li, D Hwang, Z Huo, J Bai, G Prakash… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Foundation models (FMs), that are trained on broad data at scale and are adaptable to a
wide range of downstream tasks, have brought large interest in the research community …

Ace-vc: Adaptive and controllable voice conversion using explicitly disentangled self-supervised speech representations

S Hussain, P Neekhara, J Huang, J Li… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
In this work, we propose a zero-shot voice conversion method using speech representations
trained with self-supervised learning. First, we develop a multi-task model to decompose a …