[PDF][PDF] Recent advances in end-to-end automatic speech recognition
J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …
A review of speaker diarization: Recent advances with deep learning
Speaker diarization is a task to label audio or video recordings with classes that correspond
to speaker identity, or in short, a task to identify “who spoke when”. In the early years …
to speaker identity, or in short, a task to identify “who spoke when”. In the early years …
M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge
Recent development of speech signal processing, such as speech recognition, speaker
diarization, etc., has inspired numerous applications of speech technologies. The meeting …
diarization, etc., has inspired numerous applications of speech technologies. The meeting …
Streaming multi-talker ASR with token-level serialized output training
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …
GPU-accelerated guided source separation for meeting transcription
Guided source separation (GSS) is a type of target-speaker extraction method that relies on
pre-computed speaker activities and blind source separation to perform front-end …
pre-computed speaker activities and blind source separation to perform front-end …
Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers
We propose an end-to-end speaker-attributed automatic speech recognition model that
unifies speaker counting, speech recognition, and speaker identification on monaural …
unifies speaker counting, speech recognition, and speaker identification on monaural …
Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning
Lyrics are the words that make up a song, while chords are harmonic sets of multiple notes
in music. Lyrics and chords are generally essential information in music, ie unaccompanied …
in music. Lyrics and chords are generally essential information in music, ie unaccompanied …
One model to rule them all? towards end-to-end joint speaker diarization and speech recognition
This paper presents a novel framework for joint speaker diarization (SD) and automatic
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …
CoVoMix: Advancing zero-shot speech generation for human-like multi-talker conversations
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant
strides in generating high-fidelity and diverse speech. However, dialogue generation, along …
strides in generating high-fidelity and diverse speech. However, dialogue generation, along …
Extending Whisper with prompt tuning to target-speaker ASR
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech
of a target speaker from multi-talker overlapped utterances. Most of the existing target …
of a target speaker from multi-talker overlapped utterances. Most of the existing target …