Dinov2: Learning robust visual features without supervision

M Oquab, T Darcet, T Moutakanni, H Vo… - arxiv preprint arxiv …, 2023 - arxiv.org
The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Scaling vision transformers to 22 billion parameters

M Dehghani, J Djolonga, B Mustafa… - International …, 2023 - proceedings.mlr.press
The scaling of Transformers has driven breakthrough capabilities for language models. At
present, the largest large language models (LLMs) contain upwards of 100B parameters …

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

M Dehghani, B Mustafa, J Djolonga… - Advances in …, 2023 - proceedings.neurips.cc
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …

Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving

A Ando, S Gidaris, A Bursuc, G Puy… - Proceedings of the …, 2023 - openaccess.thecvf.com
Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, eg, via
range projection, is an effective and popular approach. These projection-based methods …

Plainmamba: Improving non-hierarchical mamba in visual recognition

C Yang, Z Chen, M Espinosa, L Ericsson… - arxiv preprint arxiv …, 2024 - arxiv.org
We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for
general visual recognition. The recent Mamba model has shown how SSMs can be highly …

Getting vit in shape: Scaling laws for compute-optimal model design

IM Alabdulmohsin, X Zhai… - Advances in Neural …, 2023 - proceedings.neurips.cc
Scaling laws have been recently employed to derive compute-optimal model size (number
of parameters) for a given compute duration. We advance and refine such methods to infer …

Rotary position embedding for vision transformer

B Heo, S Park, D Han, S Yun - European Conference on Computer Vision, 2024 - Springer
Abstract Rotary Position Embedding (RoPE) performs remarkably on language models,
especially for length extrapolation of Transformers. However, the impacts of RoPE on …

Which tokens to use? investigating token reduction in vision transformers

JB Haurum, S Escalera, GW Taylor… - Proceedings of the …, 2023 - openaccess.thecvf.com
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arxiv preprint arxiv:2409.12961, 2024 - arxiv.org
Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …