Google znalac

M Oquab, T Darcet, T Moutakanni, H Vo… - arxiv preprint arxiv …, 2023 - arxiv.org

The recent breakthroughs in natural language processing for model pretraining on large
quantities of data have opened the way for similar foundation models in computer vision …

Spremi Citiraj Spominje se 2374 puta Srodni članci Svih 11 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Scaling vision transformers to 22 billion parameters

M Dehghani, J Djolonga, B Mustafa… - International …, 2023 - proceedings.mlr.press

The scaling of Transformers has driven breakthrough capabilities for language models. At
present, the largest large language models (LLMs) contain upwards of 100B parameters …

Spremi Citiraj Spominje se 569 puta Srodni članci Svih 9 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …

Spremi Citiraj Spominje se 125 puta Srodni članci Svih 3 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

M Dehghani, B Mustafa, J Djolonga… - Advances in …, 2023 - proceedings.neurips.cc

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …

Spremi Citiraj Spominje se 83 puta Srodni članci Svih 7 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving

A Ando, S Gidaris, A Bursuc, G Puy… - Proceedings of the …, 2023 - openaccess.thecvf.com

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, eg, via
range projection, is an effective and popular approach. These projection-based methods …

Spremi Citiraj Spominje se 91 puta Srodni članci Svih 13 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Plainmamba: Improving non-hierarchical mamba in visual recognition

C Yang, Z Chen, M Espinosa, L Ericsson… - arxiv preprint arxiv …, 2024 - arxiv.org

We present PlainMamba: a simple non-hierarchical state space model (SSM) designed for
general visual recognition. The recent Mamba model has shown how SSMs can be highly …

Spremi Citiraj Spominje se 72 puta Srodni članci Svih 4 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Getting vit in shape: Scaling laws for compute-optimal model design

IM Alabdulmohsin, X Zhai… - Advances in Neural …, 2023 - proceedings.neurips.cc

Scaling laws have been recently employed to derive compute-optimal model size (number
of parameters) for a given compute duration. We advance and refine such methods to infer …

Spremi Citiraj Spominje se 42 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Rotary position embedding for vision transformer

B Heo, S Park, D Han, S Yun - European Conference on Computer Vision, 2024 - Springer

Abstract Rotary Position Embedding (RoPE) performs remarkably on language models,
especially for length extrapolation of Transformers. However, the impacts of RoPE on …

Spremi Citiraj Spominje se 23 puta Srodni članci Svih 8 inačica

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Which tokens to use? investigating token reduction in vision transformers

JB Haurum, S Escalera, GW Taylor… - Proceedings of the …, 2023 - openaccess.thecvf.com

Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …

Spremi Citiraj Spominje se 38 puta Srodni članci Svih 5 inačica Prikaži kao HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution

Z Liu, Y Dong, Z Liu, W Hu, J Lu, Y Rao - arxiv preprint arxiv:2409.12961, 2024 - arxiv.org

Visual data comes in various forms, ranging from small icons of just a few pixels to long
videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual …

Spremi Citiraj Spominje se 31 puta Srodni članci Svih 2 inačica Prikaži kao HTML

Stvori obavijest

Citiraj

Napredno pretraživanje

Spremljeno u Moju knjižnicu

Flexivit: One model for all patch sizes

Dinov2: Learning robust visual features without supervision

Scaling vision transformers to 22 billion parameters

Paligemma: A versatile 3b vlm for transfer

Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution

Rangevit: Towards vision transformers for 3d semantic segmentation in autonomous driving

Plainmamba: Improving non-hierarchical mamba in visual recognition

Getting vit in shape: Scaling laws for compute-optimal model design

Rotary position embedding for vision transformer

Which tokens to use? investigating token reduction in vision transformers

Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution