A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition

Y Dai, H Chen, J Du, R Wang, S Chen… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Abstract Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed
to be sensitive to missing video frames performing even worse than single-modality models …

Voxblink: A large scale speaker verification dataset on camera

Y Lin, X Qin, G Zhao, M Cheng, N Jiang… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
In this paper, we introduce a large-scale and high-quality audiovisual speaker verification
dataset, named VoxBlink. We propose an innovative and robust automatic audio-visual data …

Voxblink2: A 100k+ speaker recognition corpus and the open-set speaker-identification benchmark

Y Lin, M Cheng, F Zhang, Y Gao, S Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In this paper, we provide a large audio-visual speaker recognition dataset, VoxBlink2, which
includes approximately 10M utterances with videos from 110K+ speakers in the wild. This …

Multi-input multi-output target-speaker voice activity detection for unified, flexible, and robust audio-visual speaker diarization

M Cheng, M Li - arxiv preprint arxiv:2401.08052, 2024‏ - arxiv.org
Audio-visual learning has demonstrated promising results in many classical speech tasks
(eg, speech separation, automatic speech recognition, wake-word spotting). We believe that …

The dku-msxf diarization system for the voxceleb speaker recognition challenge 2023

M Cheng, W Wang, X Qin, Y Lin, N Jiang… - National Conference on …, 2023‏ - Springer
This paper describes the DKU-MSXF submission to track 4 of the VoxCeleb Speaker
Recognition Challenge 2023 (VoxSRC-23). Our system pipeline contains voice activity …

Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and Representation

M Cheng, Y Lin, M Li - arxiv preprint arxiv:2411.13849, 2024‏ - arxiv.org
This paper proposes a novel Sequence-to-Sequence Neural Diarization (SSND) framework
to perform online and offline speaker diarization. It is developed from the sequence-to …

Summary on the multimodal information based speech processing (MISP) 2022 challenge

H Chen, S Wu, Y Dai, Z Wang, J Du… - ICASSP 2023-2023 …, 2023‏ - ieeexplore.ieee.org
The Multimodal Information based Speech Processing (MISP) 2022 challenge aimed to
enhance speech processing performance in harsh acoustic environments by leveraging …

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

H Zhao, L Zhang, Y Li, Y Wang, H Wang, W Rao… - National Conference on …, 2023‏ - Springer
The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual
speaker diarization systems. To improve the performance of audio-visual speaker …