Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org
The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Spatiotemporal contrastive video representation learning

R Qian, T Meng, B Gong, MH Yang… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to
learn spatiotemporal visual representations from unlabeled videos. Our representations are …

Self-supervised co-training for video representation learning

T Han, W **e, A Zisserman - Advances in neural information …, 2020 - proceedings.neurips.cc
The objective of this paper is visual-only self-supervised video representation learning. We
make the following contributions:(i) we investigate the benefit of adding semantic-class …

Videomoco: Contrastive video representation learning with temporally adversarial examples

T Pan, Y Song, T Yang, W Jiang… - Proceedings of the …, 2021 - openaccess.thecvf.com
MoCo is effective for unsupervised image representation learning. In this paper, we propose
VideoMoCo for unsupervised video representation learning. Given a video sequence as an …

Self-supervised multimodal versatile networks

JB Alayrac, A Recasens, R Schneider… - Advances in neural …, 2020 - proceedings.neurips.cc
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …

Data-efficient image recognition with contrastive predictive coding

O Henaff - International conference on machine learning, 2020 - proceedings.mlr.press
Human observers can learn to recognize new categories of images from a handful of
examples, yet doing so with artificial ones remains an open challenge. We hypothesize that …

Self-supervised visual feature learning with deep neural networks: A survey

L **g, Y Tian - IEEE transactions on pattern analysis and …, 2020 - ieeexplore.ieee.org
Large-scale labeled data are generally required to train deep neural networks in order to
obtain better performance in visual feature learning from images or videos for computer …

Self-supervised video representation learning by pace prediction

J Wang, J Jiao, YH Liu - Computer Vision–ECCV 2020: 16th European …, 2020 - Springer
This paper addresses the problem of self-supervised video representation learning from a
new perspective–by video pace prediction. It stems from the observation that human visual …

Self-supervised learning by cross-modal audio-video clustering

H Alwassel, D Mahajan, B Korbar… - Advances in …, 2020 - proceedings.neurips.cc
Visual and audio modalities are highly correlated, yet they contain different information.
Their strong correlation makes it possible to predict the semantics of one from the other with …