Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Vision transformers are parameter-efficient audio-visual learners
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
A joint cross-attention model for audio-visual fusion in dimensional emotion recognition
Multi-modal emotion recognition has recently gained much attention since it can leverage
diverse and complementary relationships over multiple modalities, such as audio, visual …
diverse and complementary relationships over multiple modalities, such as audio, visual …
Annotation-free audio-visual segmentation
Abstract The objective of Audio-Visual Segmentation (AVS) is to localise the sounding
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …
objects within visual scenes by accurately predicting pixel-wise segmentation masks. To …
Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception
With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …
Temporal action localization in the deep learning era: A survey
The temporal action localization research aims to discover action instances from untrimmed
videos, representing a fundamental step in the field of intelligent video understanding. With …
videos, representing a fundamental step in the field of intelligent video understanding. With …
Boosting weakly-supervised temporal action localization with text information
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action
Localization (WTAL) methods are generally stuck into over-complete or incomplete …
Localization (WTAL) methods are generally stuck into over-complete or incomplete …
Learning action completeness from points for weakly-supervised temporal action localization
We tackle the problem of localizing temporal intervals of actions with only a single frame
label for each action instance for training. Owing to label sparsity, existing work fails to learn …
label for each action instance for training. Owing to label sparsity, existing work fails to learn …
Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing
The audio-visual video parsing task aims to temporally parse a video into audio or visual
event categories. However, it is labor intensive to temporally annotate audio and visual …
event categories. However, it is labor intensive to temporally annotate audio and visual …
Audio-visual segmentation via unlabeled frame exploitation
Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames.
Although great progress has been witnessed we experimentally reveal that current methods …
Although great progress has been witnessed we experimentally reveal that current methods …
Audio-adaptive activity recognition across video domains
This paper strives for activity recognition under domain shift, for example caused by change
of scenery or camera viewpoint. The leading approaches reduce the shift in activity …
of scenery or camera viewpoint. The leading approaches reduce the shift in activity …