Audio-visual instance segmentation

R Guo, X Ying, Y Chen, D Niu, G Li, L Qu, Y Qi… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we propose a new multi-modal task, termed audio-visual instance
segmentation (AVIS), which aims to simultaneously identify, segment and track individual …

Toward Long Form Audio-Visual Video Understanding

W Hou, G Li, Y Tian, D Hu - ACM Transactions on Multimedia Computing …, 2024 - dl.acm.org
We live in a world filled with never-ending streams of multimodal information. As a more
natural recording of the real scenario, long form audio-visual videos (LFAVs) are expected …

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

G Li, H Du, D Hu - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
The Audio Visual Question Answering (AVQA) task aims to answer questions related to
various visual objects, sounds, and their interactions in videos. Such naturally multimodal …

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

L Wang, B Zhu, Y Chen, J Wang - arxiv preprint arxiv:2412.20872, 2024 - arxiv.org
Audio-visual video parsing focuses on classifying videos through weak labels while
identifying events as either visible, audible, or both, alongside their respective temporal …