Efficientformer: Vision transformers at mobilenet speed

Y Li, G Yuan, Y Wen, J Hu… - Advances in …, 2022 - proceedings.neurips.cc
Abstract Vision Transformers (ViT) have shown rapid progress in computer vision tasks,
achieving promising results on various benchmarks. However, due to the massive number of …

Scaling & shifting your features: A new baseline for efficient model tuning

D Lian, D Zhou, J Feng, X Wang - Advances in Neural …, 2022 - proceedings.neurips.cc
Existing fine-tuning methods either tune all parameters of the pre-trained model (full fine-
tuning), which is not efficient, or only tune the last linear layer (linear probing), which suffers …

Deit iii: Revenge of the vit

H Touvron, M Cord, H Jégou - European conference on computer vision, 2022 - Springer
Abstract A Vision Transformer (ViT) is a simple neural architecture amenable to serve
several computer vision tasks. It has limited built-in architectural priors, in contrast to more …

Surgical fine-tuning improves adaptation to distribution shifts

Y Lee, AS Chen, F Tajwar, A Kumar, H Yao… - arxiv preprint arxiv …, 2022 - arxiv.org
A common approach to transfer learning under distribution shift is to fine-tune the last few
layers of a pre-trained model, preserving learned features while also adapting to the new …

One-peace: Exploring one general representation model toward unlimited modalities

P Wang, S Wang, J Lin, S Bai, X Zhou, J Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org
In this work, we explore a scalable way for building a general representation model toward
unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B …

Masked world models for visual control

Y Seo, D Hafner, H Liu, F Liu, S James… - … on Robot Learning, 2023 - proceedings.mlr.press
Visual model-based reinforcement learning (RL) has the potential to enable sample-efficient
robot learning from visual observations. Yet the current approaches typically train a single …

No representation rules them all in category discovery

S Vaze, A Vedaldi, A Zisserman - Advances in Neural …, 2024 - proceedings.neurips.cc
In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically,
given a dataset with labelled and unlabelled images, the task is to cluster all images in the …

Convmae: Masked convolution meets masked autoencoders

P Gao, T Ma, H Li, Z Lin, J Dai, Y Qiao - arxiv preprint arxiv:2205.03892, 2022 - arxiv.org
Vision Transformers (ViT) become widely-adopted architectures for various vision tasks.
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer …

Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving

A Ando, S Gidaris, A Bursuc, G Puy… - Proceedings of the …, 2023 - openaccess.thecvf.com
Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, eg, via
range projection, is an effective and popular approach. These projection-based methods …

Cat-seg: Cost aggregation for open-vocabulary semantic segmentation

S Cho, H Shin, S Hong, A Arnab… - Proceedings of the …, 2024 - openaccess.thecvf.com
Open-vocabulary semantic segmentation presents the challenge of labeling each pixel
within an image based on a wide range of text descriptions. In this work we introduce a …