Masked autoencoders that listen

PY Huang, H Xu, J Li, A Baevski… - Advances in …, 2022 - proceedings.neurips.cc
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …

Mavil: Masked audio-video learners

PY Huang, V Sharma, H Xu, C Ryali… - Advances in …, 2024 - proceedings.neurips.cc
Abstract We present Masked Audio-Video Learners (MAViL) to learn audio-visual
representations with three complementary forms of self-supervision:(1) reconstructing …

Video transformers: A survey

J Selva, AS Johansen, S Escalera… - … on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer models have shown great success handling long-range interactions, making
them a promising tool for modeling video. However, they lack inductive biases and scale …

Multi-modal learning with missing modality via shared-specific feature modelling

H Wang, Y Chen, C Ma, J Avery… - Proceedings of the …, 2023 - openaccess.thecvf.com
The missing modality issue is critical but non-trivial to be solved by multi-modal models.
Current methods aiming to handle the missing modality problem in multi-modal tasks, either …

Probabilistic representations for video contrastive learning

J Park, J Lee, IJ Kim, K Sohn - Proceedings of the IEEE/CVF …, 2022 - openaccess.thecvf.com
Abstract This paper presents Probabilistic Video Contrastive Learning, a self-supervised
representation learning method that bridges contrastive learning with probabilistic …

Visual acoustic matching

C Chen, R Gao, P Calamia… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce the visual acoustic matching task, in which an audio clip is transformed to
sound like it was recorded in a target environment. Given an image of the target environment …

Self-supervised audio-visual soundscape stylization

T Li, R Wang, PY Huang, A Owens… - … on Computer Vision, 2024 - Springer
Speech sounds convey a great deal of information about the scenes, resulting in a variety of
effects ranging from reverberation to additional ambient sounds. In this paper, we …

Learning long-term spatial-temporal graphs for active speaker detection

K Min, S Roy, S Tripathi, T Guha… - European Conference on …, 2022 - Springer
Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it
requires learning effective audiovisual features and spatial-temporal correlations over long …

Semi-supervised temporal action detection with proposal-free masking

S Nag, X Zhu, YZ Song, T **ang - European Conference on Computer …, 2022 - Springer
Existing temporal action detection (TAD) methods rely on a large number of training data
with segment-level annotations. Collecting and annotating such a training set is thus highly …

Motion sensitive contrastive learning for self-supervised video representation

J Ni, N Zhou, J Qin, Q Wu, J Liu, B Li… - European Conference on …, 2022 - Springer
Contrastive learning has shown great potential in video representation learning. However,
existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial …