A comprehensive survey on pretrained foundation models: A history from bert to chatgpt

C Zhou, Q Li, C Li, J Yu, Y Liu, G Wang… - International Journal of …, 2024 - Springer
Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …

Transformers in vision: A survey

S Khan, M Naseer, M Hayat, SW Zamir… - ACM computing …, 2022 - dl.acm.org
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Scaling up your kernels to 31x31: Revisiting large kernel design in cnns

X Ding, X Zhang, J Han, G Ding - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
We revisit large kernel design in modern convolutional neural networks (CNNs). Inspired by
recent advances in vision transformers (ViTs), in this paper, we demonstrate that using a few …

Transformer-based unsupervised contrastive learning for histopathological image classification

X Wang, S Yang, J Zhang, M Wang, J Zhang… - Medical image …, 2022 - Elsevier
A large-scale and well-annotated dataset is a key factor for the success of deep learning in
medical image analysis. However, assembling such large annotations is very challenging …

Point-bert: Pre-training 3d point cloud transformers with masked point modeling

X Yu, L Tang, Y Rao, T Huang… - Proceedings of the …, 2022 - openaccess.thecvf.com
We present Point-BERT, a novel paradigm for learning Transformers to generalize the
concept of BERT onto 3D point cloud. Following BERT, we devise a Masked Point Modeling …

Self-supervised pre-training of swin transformers for 3d medical image analysis

Y Tang, D Yang, W Li, HR Roth… - Proceedings of the …, 2022 - openaccess.thecvf.com
Abstract Vision Transformers (ViT) s have shown great performance in self-supervised
learning of global and local representations that can be transferred to downstream …

Denseclip: Language-guided dense prediction with context-aware prompting

Y Rao, W Zhao, G Chen, Y Tang… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recent progress has shown that large-scale pre-training using contrastive image-text pairs
can be a promising alternative for high-quality visual representation learning from natural …

ibot: Image bert pre-training with online tokenizer

J Zhou, C Wei, H Wang, W Shen, C **e, A Yuille… - arxiv preprint arxiv …, 2021 - arxiv.org
The success of language Transformers is primarily attributed to the pretext task of masked
language modeling (MLM), where texts are first tokenized into semantically meaningful …

Beit: Bert pre-training of image transformers

H Bao, L Dong, S Piao, F Wei - arxiv preprint arxiv:2106.08254, 2021 - arxiv.org
We introduce a self-supervised vision representation model BEiT, which stands for
Bidirectional Encoder representation from Image Transformers. Following BERT developed …