emotion2vec: Self-supervised pre-training for speech emotion representation

Z Ma, Z Zheng, J Ye, J Li, Z Gao, S Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
We propose emotion2vec, a universal speech emotion representation model. emotion2vec
is pre-trained on open-source unlabeled emotion data through self-supervised online …

Foundation models for music: A survey

Y Ma, A Øland, A Ragni, BMS Del Sette, C Saitis… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent years, foundation models (FMs) such as large language models (LLMs) and latent
diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This …

Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training

Z Zhang, L Zhou, J Ao, S Liu, L Dai, J Li… - arxiv preprint arxiv …, 2022 - arxiv.org
The rapid development of single-modal pre-training has prompted researchers to pay more
attention to cross-modal pre-training methods. In this paper, we propose a unified-modal …

Speechlm: Enhanced speech pre-training with unpaired textual data

Z Zhang, S Chen, L Zhou, Y Wu, S Ren… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
How to boost speech pre-training with textual data is an unsolved problem due to the fact
that speech and text are very different modalities with distinct characteristics. In this paper …

Reducing barriers to self-supervised learning: Hubert pre-training with academic compute

W Chen, X Chang, Y Peng, Z Ni, S Maiti… - arxiv preprint arxiv …, 2023 - arxiv.org
Self-supervised learning (SSL) has led to great strides in speech processing. However, the
resources needed to train these models has become prohibitively large as they continue to …

MT4SSL: Boosting self-supervised speech representation learning by integrating multiple targets

Z Ma, Z Zheng, C Tang, Y Wang, X Chen - arxiv preprint arxiv:2211.07321, 2022 - arxiv.org
In this paper, we provide a new perspective on self-supervised speech models from how the
self-training targets are obtained. We generalize the targets extractor into Offline Targets …

Pushing the limits of unsupervised unit discovery for SSL speech representation

Z Ma, Z Zheng, G Yang, Y Wang, C Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
The excellent generalization ability of self-supervised learning (SSL) for speech foundation
models has garnered significant attention. HuBERT is a successful example that utilizes …

Fast-Hubert: An efficient training framework for self-supervised speech representation learning

G Yang, Z Ma, Z Zheng, Y Song, Z Niu… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
Recent years have witnessed significant advancements in self-supervised learning (SSL)
methods for speech-processing tasks. Various speech-based SSL models have been …

CTCBERT: Advancing hidden-unit bert with CTC objectives

R Fan, Y Wang, Y Gaur, J Li - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
In this work, we present a simple but effective method, CTCBERT, for advancing hidden-unit
BERT (HuBERT). HuBERT applies a frame-level cross-entropy (CE) loss, which is similar to …

Token2vec: A joint self-supervised pre-training framework using unpaired speech and text

X Yue, J Ao, X Gao, H Li - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Self-supervised pre-training has been successful in both text and speech processing.
Speech and text offer different but complementary information. The question is whether we …