A review of speaker diarization: Recent advances with deep learning

TJ Park, N Kanda, D Dimitriadis, KJ Han… - Computer Speech & …, 2022 - Elsevier
Speaker diarization is a task to label audio or video recordings with classes that correspond
to speaker identity, or in short, a task to identify “who spoke when”. In the early years …

Survey of deep learning paradigms for speech processing

KB Bhangale, M Kothandaraman - Wireless Personal Communications, 2022 - Springer
Over the past decades, a particular focus is given to research on machine learning
techniques for speech processing applications. However, in the past few years, research …

The chime-7 dasr challenge: Distant meeting transcription with multiple devices in diverse scenarios

S Cornell, M Wiesner, S Watanabe, D Raj… - arxiv preprint arxiv …, 2023 - arxiv.org
The CHiME challenges have played a significant role in the development and evaluation of
robust automatic speech recognition (ASR) systems. We introduce the CHiME-7 distant ASR …

Streaming multi-talker ASR with token-level serialized output training

N Kanda, J Wu, Y Wu, X **ao, Z Meng, X Wang… - arxiv preprint arxiv …, 2022 - arxiv.org
This paper proposes a token-level serialized output training (t-SOT), a novel framework for
streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi …

Attention-based encoder-decoder end-to-end neural diarization with embedding enhancer

Z Chen, B Han, S Wang, Y Qian - IEEE/ACM Transactions on …, 2024 - ieeexplore.ieee.org
Deep neural network-based systems have significantly improved the performance of
speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often …

GPU-accelerated guided source separation for meeting transcription

D Raj, D Povey, S Khudanpur - arxiv preprint arxiv:2212.05271, 2022 - arxiv.org
Guided source separation (GSS) is a type of target-speaker extraction method that relies on
pre-computed speaker activities and blind source separation to perform front-end …

Notsofar-1 challenge: New datasets, baseline, and tasks for distant meeting transcription

A Vinnikov, A Ivry, A Hurvitz, I Abramovski… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings
(``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge …

One model to rule them all? towards end-to-end joint speaker diarization and speech recognition

S Cornell, J Jung, S Watanabe… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
This paper presents a novel framework for joint speaker diarization (SD) and automatic
speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented …

On word error rate definitions and their efficient computation for multi-speaker speech recognition systems

T von Neumann, C Boeddeker… - ICASSP 2023-2023 …, 2023 - ieeexplore.ieee.org
We propose a general framework to compute the word error rate (WER) of ASR systems that
process recordings containing multiple speakers at their input and that produce multiple …

End-to-end speaker-attributed ASR with transformer

N Kanda, G Ye, Y Gaur, X Wang, Z Meng… - arxiv preprint arxiv …, 2021 - arxiv.org
This paper presents our recent effort on end-to-end speaker-attributed automatic speech
recognition, which jointly performs speaker counting, speech recognition and speaker …