Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021‏ - dl.acm.org
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …

A comprehensive survey on video saliency detection with auditory information: the audio-visual consistency perceptual is the key!

C Chen, M Song, W Song, L Guo… - IEEE Transactions on …, 2022‏ - ieeexplore.ieee.org
Video saliency detection (VSD) aims at fast locating the most attractive
objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied …

A light weight model for active speaker detection

J Liao, H Duan, K Feng, W Zhao… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Active speaker detection is a challenging task in audio-visual scenarios, with the aim to
detect who is speaking in one or more speaker scenarios. This task has received …

Loconet: Long-short context network for active speaker detection

X Wang, F Cheng, G Bertasius - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
Abstract Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a
video. Solving ASD involves using audio and visual information in two complementary …

Sd-nerf: Towards lifelike talking head animation via spatially-adaptive dual-driven nerfs

S Shen, W Li, X Huang, Z Zhu… - IEEE Transactions on …, 2023‏ - ieeexplore.ieee.org
Recent years have witnessed great progress in audio-driven talking head animation. Among
these methods, the 3D-based ones better preserve the 3D consistency of the generated …

Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking

Y Li, H Liu, H Tang - Proceedings of the AAAI Conference on Artificial …, 2022‏ - ojs.aaai.org
Multi-modal fusion is proven to be an effective method to improve the accuracy and
robustness of speaker tracking, especially in complex scenarios. However, how to combine …

Speaker recognition with two-step multi-modal deep cleansing

R Tao, KA Lee, Z Shi, H Li - ICASSP 2023-2023 IEEE …, 2023‏ - ieeexplore.ieee.org
Neural network-based speaker recognition has achieved significant improvement in recent
years. A robust speaker representation learns meaningful knowledge from both hard and …

Deep audio-visual beamforming for speaker localization

X Qian, Q Zhang, G Guan, W Xue - IEEE Signal Processing …, 2022‏ - ieeexplore.ieee.org
Generalized Cross Correlation (GCC) is the most popular localization technique over the
past decades and can be extended with the beamforming method eg Steered Response …

Multi-stage Face-voice Association Learning with Keynote Speaker Diarization

R Tao, Z Shi, Y Jiang, DT Truong, ES Chng… - Proceedings of the …, 2024‏ - dl.acm.org
The human brain has the capability to associate the unknown person's voice and face by
leveraging their general relationship, referred to as" cross-modal speaker verification''. This …

Audiovisual Tracking of Multiple Speakers in Smart Spaces

F Sanabria-Macias, M Marron-Romera… - Sensors, 2023‏ - mdpi.com
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on
particle filters and a probabilistic framework, employing a single camera and a microphone …