Wavlm: Large-scale self-supervised pre-training for full stack speech processing
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …
exploration has been attempted for other speech processing tasks. As speech signal …
Streaming multi-talker ASR with token-level serialized output training
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …
A deep hierarchical fusion network for fullband acoustic echo cancellation
Deep learning based wideband (16kHz) acoustic echo cancellation (AEC) approaches have
surpassed traditional methods. This work proposes a deep hierarchical fusion (DHF) …
surpassed traditional methods. This work proposes a deep hierarchical fusion (DHF) …
Speech separation with large-scale self-supervised learning
Self-supervised learning (SSL) methods such as WavLM have shown promising speech
separation (SS) results in small-scale simulation-based experiments. In this work, we extend …
separation (SS) results in small-scale simulation-based experiments. In this work, we extend …
Serialized output training by learned dominance
Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker
speech recognition by sequentially decoding the speech of individual speakers. To address …
speech recognition by sequentially decoding the speech of individual speakers. To address …
On Speaker Attribution with SURT
The Streaming Unmixing and Recognition Transducer (SURT) has recently become a
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …
popular framework for continuous, streaming, multi-talker speech recognition (ASR). With …
Polyscriber: Integrated fine-tuning of extractor and lyrics transcriber for polyphonic music
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics
intelligibility. Typically, lyrics transcription can be performed by a two-step pipeline, ie a …
intelligibility. Typically, lyrics transcription can be performed by a two-step pipeline, ie a …
Multi-stage and multi-loss training for fullband non-personalized and personalized speech enhancement
Deep learning-based wideband (16kHz) speech enhancement approaches have surpassed
traditional methods. This work further extends the existing wideband systems to enable full …
traditional methods. This work further extends the existing wideband systems to enable full …
Keyword Guided Target Speech Recognition
This letter presents a new target speech recognition problem, where the target speech is
defined by a keyword. For instance, when a person speaks “Hey Google” or “Help Me”, we …
defined by a keyword. For instance, when a person speaks “Hey Google” or “Help Me”, we …
Leveraging real conversational data for multi-channel continuous speech separation
Existing multi-channel continuous speech separation (CSS) models are heavily dependent
on supervised data-either simulated data which causes data mismatch between the training …
on supervised data-either simulated data which causes data mismatch between the training …