Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

Streaming multi-talker ASR with token-level serialized output training

N Kanda, J Wu, Y Wu, X **ao, Z Meng, X Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …

A deep hierarchical fusion network for fullband acoustic echo cancellation

H Zhao, N Li, R Han, L Chen, X Zheng… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Deep learning based wideband (16kHz) acoustic echo cancellation (AEC) approaches have
surpassed traditional methods. This work proposes a deep hierarchical fusion (DHF) …

Speech separation with large-scale self-supervised learning

Z Chen, N Kanda, J Wu, Y Wu, X Wang… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
Self-supervised learning (SSL) methods such as WavLM have shown promising speech
separation (SS) results in small-scale simulation-based experiments. In this work, we extend …

Serialized output training by learned dominance

Y Shi, L Li, S Yin, D Wang, J Han - arxiv preprint arxiv:2407.03966, 2024 - arxiv.org
Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker
speech recognition by sequentially decoding the speech of individual speakers. To address …

On Speaker Attribution with SURT

D Raj, M Wiesner, M Maciejewski… - arxiv preprint arxiv …, 2024 - arxiv.org
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …

Polyscriber: Integrated fine-tuning of extractor and lyrics transcriber for polyphonic music

X Gao, C Gupta, H Li - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics
intelligibility. Typically, lyrics transcription can be performed by a two-step pipeline, ie a …

Multi-stage and multi-loss training for fullband non-personalized and personalized speech enhancement

L Chen, C Xu, X Zhang, X Ren, X Zheng… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Deep learning-based wideband (16kHz) speech enhancement approaches have surpassed
traditional methods. This work further extends the existing wideband systems to enable full …

Keyword Guided Target Speech Recognition

Y Shi, L Li, D Wang, J Han - IEEE Signal Processing Letters, 2024 - ieeexplore.ieee.org
This letter presents a new target speech recognition problem, where the target speech is
defined by a keyword. For instance, when a person speaks “Hey Google” or “Help Me”, we …

Leveraging real conversational data for multi-channel continuous speech separation

X Wang, D Wang, N Kanda, SE Eskimez… - arxiv preprint arxiv …, 2022 - arxiv.org
Existing multi-channel continuous speech separation (CSS) models are heavily dependent
on supervised data-either simulated data which causes data mismatch between the training …