[PDF][PDF] Recent advances in end-to-end automatic speech recognition

J Li - APSIPA Transactions on Signal and Information …, 2022 - nowpublishers.com
Recently, the speech community is seeing a significant trend of moving from deep neural
network based hybrid modeling to end-to-end (E2E) modeling for automatic speech …

A review of speaker diarization: Recent advances with deep learning

TJ Park, N Kanda, D Dimitriadis, KJ Han… - Computer Speech & …, 2022 - Elsevier
Speaker diarization is a task to label audio or video recordings with classes that correspond
to speaker identity, or in short, a task to identify “who spoke when”. In the early years …

M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge

F Yu, S Zhang, Y Fu, L **e, S Zheng… - ICASSP 2022-2022 …, 2022 - ieeexplore.ieee.org
Recent development of speech signal processing, such as speech recognition, speaker
diarization, etc., has inspired numerous applications of speech technologies. The meeting …

Streaming multi-talker ASR with token-level serialized output training

N Kanda, J Wu, Y Wu, X **ao, Z Meng, X Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …

GPU-accelerated guided source separation for meeting transcription

D Raj, D Povey, S Khudanpur - arxiv preprint arxiv:2212.05271, 2022 - arxiv.org
Guided source separation (GSS) is a type of target-speaker extraction method that relies on
pre-computed speaker activities and blind source separation to perform front-end …

Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers

N Kanda, Y Gaur, X Wang, Z Meng, Z Chen… - arxiv preprint arxiv …, 2020 - arxiv.org
We propose an end-to-end speaker-attributed automatic speech recognition model that
unifies speaker counting, speech recognition, and speaker identification on monaural …

Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning

X Gao, C Gupta, H Li - IEEE/ACM Transactions on Audio …, 2022 - ieeexplore.ieee.org
Lyrics are the words that make up a song, while chords are harmonic sets of multiple notes
in music. Lyrics and chords are generally essential information in music, ie unaccompanied …

One model to rule them all? towards end-to-end joint speaker diarization and speech recognition

S Cornell, J Jung, S Watanabe… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
This paper presents a novel framework for joint speaker diarization (SD) and automatic
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …

CoVoMix: Advancing zero-shot speech generation for human-like multi-talker conversations

L Zhang, Y Qian, L Zhou, S Liu… - Advances in …, 2025 - proceedings.neurips.cc
Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant
strides in generating high-fidelity and diverse speech. However, dialogue generation, along …

Extending Whisper with prompt tuning to target-speaker ASR

H Ma, Z Peng, M Shao, J Li, J Liu - ICASSP 2024-2024 IEEE …, 2024 - ieeexplore.ieee.org
Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech
of a target speaker from multi-talker overlapped utterances. Most of the existing target …