Towards universal speech discrete tokens: A case study for asr and tts

Y Yang, F Shen, C Du, Z Ma, K Yu… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into
utilizing discrete tokens for speech tasks like recognition and translation, which offer lower …

CTC variations through new WFST topologies

A Laptev, S Majumdar, B Ginsburg - ar** in neural transducer
Y Yang, X Yang, L Guo, Z Yao, W Kang… - arxiv preprint arxiv …, 2023 - arxiv.org
Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end
automatic speech recognition systems. Due to their frame-synchronous design, blank …

Unsupervised Domain Adaptation on End-to-End Multi-talker Overlapped Speech Recognition

L Zheng, H Zhu, S Tian, Q Zhao… - IEEE Signal Processing …, 2024 - ieeexplore.ieee.org
Serialized Output Training (SOT) has emerged as the mainstream approach for addressing
the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT …

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

M Cui, Y Yang, J Deng, J Kang, S Hu, T Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Self-supervised learning (SSL) based discrete speech representations are highly compact
and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM …

Efficient Cascaded Streaming ASR System via Frame Rate Reduction

X Cai, D Qiu, S Ding, D Hwang… - 2023 IEEE Automatic …, 2023 - ieeexplore.ieee.org
In this paper, we explore various frame rate reduction schemes on the two-pass cascaded
encoder model to improve its efficiency without scarifying the transcription quality. We …

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Y Guo, C Wang, Y Yang, H Wang, Z Ma… - 2024 IEEE 14th …, 2024 - ieeexplore.ieee.org
Discrete speech tokens have been more and more popular in multiple speech processing
fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice …

TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer

V Bataev, S Ghosh, V Lavrukhin, J Li - arxiv preprint arxiv:2501.06320, 2025 - arxiv.org
This work introduces TTS-Transducer-a novel architecture for text-to-speech, leveraging the
strengths of audio codec models and neural transducers. Transducers, renowned for their …

[PDF][PDF] The Vicomtech Speech Transcription Systems for the Albayzin 2024 Bilingual Basque-Spanish Speech to Text (BBS-S2T) Challenge

JC Vásquez-Correa, A Alvarez, H Arzelus… - Proceedings of …, 2024 - isca-archive.org
This paper presents the Vicomtech's submission to the Albayzın 2024 Bilingual Basque-
Spanish Speech-to-Text Challenge, which involves evaluating automatic speech …

Powerful and Extensible WFST Framework for Rnn-Transducer Losses

A Laptev, V Bataev, I Gitman… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
This paper presents a framework based on Weighted Finite-State Transducers (WFST) to
simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing …