Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …
more speakers. The successful ASD depends on accurate interpretation of short-term and …
A comprehensive survey on video saliency detection with auditory information: the audio-visual consistency perceptual is the key!
Video saliency detection (VSD) aims at fast locating the most attractive
objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied …
objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied …
A light weight model for active speaker detection
Active speaker detection is a challenging task in audio-visual scenarios, with the aim to
detect who is speaking in one or more speaker scenarios. This task has received …
detect who is speaking in one or more speaker scenarios. This task has received …
Loconet: Long-short context network for active speaker detection
Abstract Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a
video. Solving ASD involves using audio and visual information in two complementary …
video. Solving ASD involves using audio and visual information in two complementary …
Sd-nerf: Towards lifelike talking head animation via spatially-adaptive dual-driven nerfs
Recent years have witnessed great progress in audio-driven talking head animation. Among
these methods, the 3D-based ones better preserve the 3D consistency of the generated …
these methods, the 3D-based ones better preserve the 3D consistency of the generated …
Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking
Multi-modal fusion is proven to be an effective method to improve the accuracy and
robustness of speaker tracking, especially in complex scenarios. However, how to combine …
robustness of speaker tracking, especially in complex scenarios. However, how to combine …
Speaker recognition with two-step multi-modal deep cleansing
Neural network-based speaker recognition has achieved significant improvement in recent
years. A robust speaker representation learns meaningful knowledge from both hard and …
years. A robust speaker representation learns meaningful knowledge from both hard and …
Deep audio-visual beamforming for speaker localization
Generalized Cross Correlation (GCC) is the most popular localization technique over the
past decades and can be extended with the beamforming method eg Steered Response …
past decades and can be extended with the beamforming method eg Steered Response …
Multi-stage Face-voice Association Learning with Keynote Speaker Diarization
The human brain has the capability to associate the unknown person's voice and face by
leveraging their general relationship, referred to as" cross-modal speaker verification''. This …
leveraging their general relationship, referred to as" cross-modal speaker verification''. This …
Audiovisual Tracking of Multiple Speakers in Smart Spaces
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on
particle filters and a probabilistic framework, employing a single camera and a microphone …
particle filters and a probabilistic framework, employing a single camera and a microphone …