Label-anticipated event disentanglement for audio-visual video parsing

J Zhou, D Guo, Y Mao, Y Zhong, X Chang… - European Conference on …, 2024 - Springer
Abstract Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate
events within audio and visual modalities. Multiple events can overlap in the timeline …

Category-adaptive label discovery and noise rejection for multi-label recognition with partial positive labels

T Pu, Q Lao, H Wu, T Chen, L Tian… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
As a cost-effective alternative to standard multi-label learning, the multi-label image
recognition with partial positive labels (MLR-PPL) task attracts increasing attention, in which …

Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling

KK Rachavarapu… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
In this paper we address the weakly-supervised Audio-Visual Video Parsing (AVVP)
problem which aims at labeling events in a video as audible visible or both and temporally …

Resisting Noise in Pseudo Labels: Audible Video Event Parsing With Evidential Learning

X Jiang, X Xu, L Zhu, Z Sun, A Cichocki… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Perceiving temporal events and discriminating their modality types in audible videos, which
is also called audio–visual video parsing (AVVP), is becoming a research hotspot in …

Segment-level event perception with semantic dictionary for weakly supervised audio-visual video parsing

Z **e, Y Yang, Y Yu, J Wang, Y Liu, Y Jiang - Knowledge-Based Systems, 2025 - Elsevier
Videos capture auditory and visual signals, each conveying distinct events. Simultaneously
analyzing these multimodal signals enhances human comprehension of the video content …

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

G Li, H Du, D Hu - Proceedings of the 32nd ACM International …, 2024 - dl.acm.org
The Audio Visual Question Answering (AVQA) task aims to answer questions related to
various visual objects, sounds, and their interactions in videos. Such naturally multimodal …

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering

T Yang, Y Nan, L Dai, Z Liang, Y Tian… - arxiv preprint arxiv …, 2024 - arxiv.org
Audio-Visual Question Answering (AVQA) is a challenging task that involves answering
questions based on both auditory and visual information in videos. A significant challenge is …

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

T Geng, T Wang, Y Zhang, J Duan, W Guan… - arxiv preprint arxiv …, 2024 - arxiv.org
Video localization tasks aim to temporally locate specific instances in videos, including
temporal action localization (TAL), sound event detection (SED) and audio-visual event …

Reinforced Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Y Gao, X Sun, G Lv, D Yu, S Niu - arxiv preprint arxiv:2412.19563, 2024 - arxiv.org
Audio-visual video parsing (AVVP) aims to recognize audio and visual event labels with
precise temporal boundaries, which is quite challenging since audio or visual modality might …