A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …
A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends
Deep supervised learning algorithms typically require a large volume of labeled data to
achieve satisfactory performance. However, the process of collecting and labeling such data …
achieve satisfactory performance. However, the process of collecting and labeling such data …
Dinov2: Learning robust visual features without supervision
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …
quantities of data have opened the way for similar foundation models in computer vision …
Clip2scene: Towards label-efficient 3d scene understanding by clip
Abstract Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D
zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP …
zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP …
Masked feature prediction for self-supervised visual pre-training
Abstract We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training
of video models. Our approach first randomly masks out a portion of the input sequence and …
of video models. Our approach first randomly masks out a portion of the input sequence and …
Simmim: A simple framework for masked image modeling
This paper presents SimMIM, a simple framework for masked image modeling. We have
simplified recently proposed relevant approaches, without the need for special designs …
simplified recently proposed relevant approaches, without the need for special designs …
Masked siamese networks for label-efficient learning
Abstract We propose Masked Siamese Networks (MSN), a self-supervised learning
framework for learning image representations. Our approach matches the representation of …
framework for learning image representations. Our approach matches the representation of …
Context autoencoder for self-supervised representation learning
We present a novel masked image modeling (MIM) approach, context autoencoder (CAE),
for self-supervised representation pretraining. We pretrain an encoder by making predictions …
for self-supervised representation pretraining. We pretrain an encoder by making predictions …
Extract free dense labels from clip
Abstract Contrastive Language-Image Pre-training (CLIP) has made a remarkable
breakthrough in open-vocabulary zero-shot image recognition. Many recent studies …
breakthrough in open-vocabulary zero-shot image recognition. Many recent studies …
Vision-language pre-training with triple contrastive learning
Vision-language representation learning largely benefits from image-text alignment through
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …
contrastive losses (eg, InfoNCE loss). The success of this alignment strategy is attributed to …