Vision transformers are parameter-efficient audio-visual learners
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …
Audio–visual segmentation
We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …
Temporal sentence grounding in videos: A survey and future directions
Temporal sentence grounding in videos (TSGV), aka, natural language video localization
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
(NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that …
Multimodal variational auto-encoder based audio-visual segmentation
Abstract We propose an Explicit Conditional Multimodal Variational Auto-Encoder
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …
(ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the …
Catr: Combinatorial-dependence audio-queried transformer for audio-visual video segmentation
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-
producing objects within image frames and ensure the maps faithfully adheres to the given …
producing objects within image frames and ensure the maps faithfully adheres to the given …
Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception
With only video-level event labels, this paper targets at the task of weakly-supervised audio-
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …
visual event perception (WS-AVEP), which aims to temporally localize and categorize events …
Positive sample propagation along the audio-visual event line
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …
Audio-visual generalised zero-shot learning with cross-modal attention and language
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …
zero-shot learning, is challenging. We conjecture that the natural alignment between the …
Contrastive positive sample propagation along the audio-visual event line
Visual and audio signals often coexist in natural environments, forming audio-visual events
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …
(AVEs). Given a video, we aim to localize video segments containing an AVE and identify its …
Cross-modal background suppression for audio-visual event localization
Audiovisual Event (AVE) localization requires the model to jointly localize an event by
observing audio and visual information. However, in unconstrained videos, both information …
observing audio and visual information. However, in unconstrained videos, both information …