A light weight model for active speaker detection

J Liao, H Duan, K Feng, W Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Active speaker detection is a challenging task in audio-visual scenarios, with the aim to
detect who is speaking in one or more speaker scenarios. This task has received …

Egocentric auditory attention localization in conversations

F Ryan, H Jiang, A Shukla… - Proceedings of the …, 2023 - openaccess.thecvf.com
In a noisy conversation environment such as a dinner party, people often exhibit selective
auditory attention, or the ability to focus on a particular speaker while tuning out others …

Target active speaker detection with audio-visual cues

Y Jiang, R Tao, Z Pan, H Li - arxiv preprint arxiv:2305.12831, 2023 - arxiv.org
In active speaker detection (ASD), we would like to detect whether an on-screen person is
speaking based on audio-visual cues. Previous studies have primarily focused on modeling …

Loconet: Long-short context network for active speaker detection

X Wang, F Cheng, G Bertasius - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a
video. Solving ASD involves using audio and visual information in two complementary …

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

C Zhao, S Liu, K Mangalam, G Qian… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large pretrained models are increasingly crucial in modern computer vision tasks. These
models are typically used in downstream tasks by end-to-end finetuning which is highly …

Egoloc: Revisiting 3d object localization from egocentric videos with visual queries

J Mai, A Hamdi, S Giancola, C Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com
With the recent advances in video and 3D understanding, novel 4D spatio-temporal
methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic …

Listen to look into the future: Audio-visual egocentric gaze anticipation

B Lai, F Ryan, W Jia, M Liu, JM Rehg - European Conference on Computer …, 2024 - Springer
Egocentric gaze anticipation serves as a key building block for the emerging capability of
Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals …

Joint audio-visual idling vehicle detection with streamlined input dependencies

X Li, R Mohammed, T Mangin, S Saha… - arxiv preprint arxiv …, 2024 - arxiv.org
Idling vehicle detection (IVD) can be helpful in monitoring and reducing unnecessary idling
and can be integrated into real-time systems to address the resulting pollution and harmful …

BIAS: A Body-based Interpretable Active Speaker Approach

T Roxo, JC Costa, PRM Inácio… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on audio and facial
features to perform, which is not a sustainable approach in wild scenarios. Although these …

Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges

V Mingote, A Ortega, A Miguel, E Lleida - arxiv preprint arxiv:2409.05659, 2024 - arxiv.org
Nowadays, the large amount of audio-visual content available has fostered the need to
develop new robust automatic speaker diarization systems to analyse and characterise it …