A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
Abstract Pretrained Foundation Models (PFMs) are regarded as the foundation for various
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …
downstream tasks across different data modalities. A PFM (eg, BERT, ChatGPT, GPT-4) is …
Attention mechanisms in computer vision: A survey
Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …
this observation, attention mechanisms were introduced into computer vision with the aim of …
Vision mamba: Efficient visual representation learning with bidirectional state space model
Recently the state space models (SSMs) with efficient hardware-aware designs, ie, the
Mamba deep learning model, have shown great potential for long sequence modeling …
Mamba deep learning model, have shown great potential for long sequence modeling …
Maxvit: Multi-axis vision transformer
Transformers have recently gained significant attention in the computer vision community.
However, the lack of scalability of self-attention mechanisms with respect to image size has …
However, the lack of scalability of self-attention mechanisms with respect to image size has …
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Pre-training video transformers on extra large-scale datasets is generally required to
achieve premier performance on relatively small datasets. In this paper, we show that video …
achieve premier performance on relatively small datasets. In this paper, we show that video …
Efficientformer: Vision transformers at mobilenet speed
Abstract Vision Transformers (ViT) have shown rapid progress in computer vision tasks,
achieving promising results on various benchmarks. However, due to the massive number of …
achieving promising results on various benchmarks. However, due to the massive number of …
PixArt-: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
The most advanced text-to-image (T2I) models require significant training costs (eg, millions
of GPU hours), seriously hindering the fundamental innovation for the AIGC community …
of GPU hours), seriously hindering the fundamental innovation for the AIGC community …
Scaling & shifting your features: A new baseline for efficient model tuning
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-
tuning), which is not efficient, or only tune the last linear layer (linear probing), which suffers …
tuning), which is not efficient, or only tune the last linear layer (linear probing), which suffers …
Deep model reassembly
In this paper, we explore a novel knowledge-transfer task, termed as Deep Model
Reassembly (DeRy), for general-purpose model reuse. Given a collection of heterogeneous …
Reassembly (DeRy), for general-purpose model reuse. Given a collection of heterogeneous …
Vision transformer with deformable attention
Transformers have recently shown superior performances on various vision tasks. The large,
sometimes even global, receptive field endows Transformer models with higher …
sometimes even global, receptive field endows Transformer models with higher …