A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …
A comprehensive survey on applications of transformers for deep learning tasks
Abstract Transformers are Deep Neural Networks (DNN) that utilize a self-attention
mechanism to capture contextual relationships within sequential data. Unlike traditional …
mechanism to capture contextual relationships within sequential data. Unlike traditional …
Scaling up gans for text-to-image synthesis
The recent success of text-to-image synthesis has taken the world by storm and captured the
general public's imagination. From a technical standpoint, it also marked a drastic change in …
general public's imagination. From a technical standpoint, it also marked a drastic change in …
Convnext v2: Co-designing and scaling convnets with masked autoencoders
Driven by improved architectures and better representation learning frameworks, the field of
visual recognition has enjoyed rapid modernization and performance boost in the early …
visual recognition has enjoyed rapid modernization and performance boost in the early …
Videomae v2: Scaling video masked autoencoders with dual masking
Scale is the primary factor for building a powerful foundation model that could well
generalize to a variety of downstream tasks. However, it is still challenging to train video …
generalize to a variety of downstream tasks. However, it is still challenging to train video …
Scalable diffusion models with transformers
We explore a new class of diffusion models based on the transformer architecture. We train
latent diffusion models of images, replacing the commonly-used U-Net backbone with a …
latent diffusion models of images, replacing the commonly-used U-Net backbone with a …
Sigmoid loss for language image pre-training
We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …
contrastive learning with softmax normalization, the sigmoid loss operates solely on image …
Scaling data-constrained language models
The current trend of scaling language models involves increasing both parameter count and
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …
training dataset size. Extrapolating this trend suggests that training dataset size may soon be …
Eva: Exploring the limits of masked visual representation learning at scale
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
Scaling language-image pre-training via masking
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …
method for training CLIP. Our method randomly masks out and removes a large portion of …