Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arxiv preprint arxiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Self-supervised multimodal learning: A survey

Y Zong, O Mac Aodha, T Hospedales - arxiv preprint arxiv:2304.01008, 2023 - arxiv.org
Multimodal learning, which aims to understand and analyze information from multiple
modalities, has achieved substantial progress in the supervised regime in recent years …

Learning to answer questions in dynamic audio-visual scenarios

G Li, Y Wei, Y Tian, C Xu, JR Wen… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to
answer questions regarding different visual objects, sounds, and their associations in …

A light weight model for active speaker detection

J Liao, H Duan, K Feng, W Zhao… - Proceedings of the …, 2023 - openaccess.thecvf.com
Active speaker detection is a challenging task in audio-visual scenarios, with the aim to
detect who is speaking in one or more speaker scenarios. This task has received …

Annotation-free audio-visual segmentation

J Liu, Y Wang, C Ju, C Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …

Audio-visual segmentation via unlabeled frame exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024 - openaccess.thecvf.com
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

Progressive spatio-temporal perception for audio-visual question answering

G Li, W Hou, D Hu - Proceedings of the 31st ACM international …, 2023 - dl.acm.org
Audio-Visual Question Answering (AVQA) task aims to answer questions about different
visual objects, sounds, and their associations in videos. Such naturally multi-modal videos …

Egocentric auditory attention localization in conversations

F Ryan, H Jiang, A Shukla… - Proceedings of the …, 2023 - openaccess.thecvf.com
In a noisy conversation environment such as a dinner party, people often exhibit selective
auditory attention, or the ability to focus on a particular speaker while tuning out others …

Prompting segmentation with sound is generalizable audio-visual source localizer

Y Wang, W Liu, G Li, J Ding, D Hu, X Li - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Never having seen an object and heard its sound simultaneously, can the model still
accurately localize its visual position from the input audio? In this work, we concentrate on …

Semantic and relation modulation for audio-visual event localization

H Wang, ZJ Zha, L Li, X Chen… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
We study the problem of localizing audio-visual events that are both audible and visible in a
video. Existing works focus on encoding and aligning audio and visual features at the …