Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Self-supervised learning for videos: A survey
The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …
large-scale annotated datasets. However, obtaining annotations is expensive and requires …
Spatiotemporal contrastive video representation learning
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to
learn spatiotemporal visual representations from unlabeled videos. Our representations are …
learn spatiotemporal visual representations from unlabeled videos. Our representations are …
Self-supervised co-training for video representation learning
The objective of this paper is visual-only self-supervised video representation learning. We
make the following contributions:(i) we investigate the benefit of adding semantic-class …
make the following contributions:(i) we investigate the benefit of adding semantic-class …
Videomoco: Contrastive video representation learning with temporally adversarial examples
MoCo is effective for unsupervised image representation learning. In this paper, we propose
VideoMoCo for unsupervised video representation learning. Given a video sequence as an …
VideoMoCo for unsupervised video representation learning. Given a video sequence as an …
Self-supervised multimodal versatile networks
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …
using self-supervision by leveraging three modalities naturally present in videos: visual …
Data-efficient image recognition with contrastive predictive coding
O Henaff - International conference on machine learning, 2020 - proceedings.mlr.press
Human observers can learn to recognize new categories of images from a handful of
examples, yet doing so with artificial ones remains an open challenge. We hypothesize that …
examples, yet doing so with artificial ones remains an open challenge. We hypothesize that …
Self-supervised visual feature learning with deep neural networks: A survey
Large-scale labeled data are generally required to train deep neural networks in order to
obtain better performance in visual feature learning from images or videos for computer …
obtain better performance in visual feature learning from images or videos for computer …
Self-supervised video representation learning by pace prediction
This paper addresses the problem of self-supervised video representation learning from a
new perspective–by video pace prediction. It stems from the observation that human visual …
new perspective–by video pace prediction. It stems from the observation that human visual …
Self-supervised learning by cross-modal audio-video clustering
Visual and audio modalities are highly correlated, yet they contain different information.
Their strong correlation makes it possible to predict the semantics of one from the other with …
Their strong correlation makes it possible to predict the semantics of one from the other with …