Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

A joint cross-attention model for audio-visual fusion in dimensional emotion recognition

RG Praveen, WC de Melo, N Ullah… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Multi-modal emotion recognition has recently gained much attention since it can leverage
diverse and complementary relationships over multiple modalities, such as audio, visual …

Annotation-free audio-visual segmentation

J Liu, Y Wang, C Ju, C Ma… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …

Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception

J Gao, M Chen, C Xu - … of the IEEE/CVF conference on …, 2023‏ - openaccess.thecvf.com
With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …

Temporal action localization in the deep learning era: A survey

B Wang, Y Zhao, L Yang, T Long… - IEEE Transactions on …, 2023‏ - ieeexplore.ieee.org
The temporal action localization research aims to discover action instances from untrimmed
videos, representing a fundamental step in the field of intelligent video understanding. With …

Boosting weakly-supervised temporal action localization with text information

G Li, D Cheng, X Ding, N Wang… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action
Localization (WTAL) methods are generally stuck into over-complete or incomplete …

Learning action completeness from points for weakly-supervised temporal action localization

P Lee, H Byun - Proceedings of the IEEE/CVF international …, 2021‏ - openaccess.thecvf.com
We tackle the problem of localizing temporal intervals of actions with only a single frame
label for each action instance for training. Owing to label sparsity, existing work fails to learn …

Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing

YB Lin, HY Tseng, HY Lee, YY Lin… - Advances in Neural …, 2021‏ - proceedings.neurips.cc
The audio-visual video parsing task aims to temporally parse a video into audio or visual
event categories. However, it is labor intensive to temporally annotate audio and visual …

Audio-visual segmentation via unlabeled frame exploitation

J Liu, Y Liu, F Zhang, C Ju… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …

Audio-adaptive activity recognition across video domains

Y Zhang, H Doughty, L Shao… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
This paper strives for activity recognition under domain shift, for example caused by change
of scenery or camera viewpoint. The leading approaches reduce the shift in activity …