Masked autoencoders that listen
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …
supervised representation learning from audio spectrograms. Following the Transformer …
Mavil: Masked audio-video learners
Abstract We present Masked Audio-Video Learners (MAViL) to learn audio-visual
representations with three complementary forms of self-supervision:(1) reconstructing …
representations with three complementary forms of self-supervision:(1) reconstructing …
Video transformers: A survey
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …
them a promising tool for modeling video. However, they lack inductive biases and scale …
Multi-modal learning with missing modality via shared-specific feature modelling
The missing modality issue is critical but non-trivial to be solved by multi-modal models.
Current methods aiming to handle the missing modality problem in multi-modal tasks, either …
Current methods aiming to handle the missing modality problem in multi-modal tasks, either …
Probabilistic representations for video contrastive learning
Abstract This paper presents Probabilistic Video Contrastive Learning, a self-supervised
representation learning method that bridges contrastive learning with probabilistic …
representation learning method that bridges contrastive learning with probabilistic …
Visual acoustic matching
We introduce the visual acoustic matching task, in which an audio clip is transformed to
sound like it was recorded in a target environment. Given an image of the target environment …
sound like it was recorded in a target environment. Given an image of the target environment …
Self-supervised audio-visual soundscape stylization
Speech sounds convey a great deal of information about the scenes, resulting in a variety of
effects ranging from reverberation to additional ambient sounds. In this paper, we …
effects ranging from reverberation to additional ambient sounds. In this paper, we …
Learning long-term spatial-temporal graphs for active speaker detection
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it
requires learning effective audiovisual features and spatial-temporal correlations over long …
requires learning effective audiovisual features and spatial-temporal correlations over long …
Semi-supervised temporal action detection with proposal-free masking
Existing temporal action detection (TAD) methods rely on a large number of training data
with segment-level annotations. Collecting and annotating such a training set is thus highly …
with segment-level annotations. Collecting and annotating such a training set is thus highly …
Motion sensitive contrastive learning for self-supervised video representation
Contrastive learning has shown great potential in video representation learning. However,
existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial …
existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial …