Transformer for object re-identification: A survey

M Ye, S Chen, C Li, WS Zheng, D Crandall… - International Journal of …, 2024 - Springer
Abstract Object Re-identification (Re-ID) aims to identify specific objects across different
times and scenes, which is a widely researched task in computer vision. For a prolonged …

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

Which tokens to use? investigating token reduction in vision transformers

JB Haurum, S Escalera, GW Taylor… - Proceedings of the …, 2023 - openaccess.thecvf.com
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …

Diffusion models beat gans on image classification

S Mukhopadhyay, M Gwilliam, V Agarwal… - arxiv preprint arxiv …, 2023 - arxiv.org
While many unsupervised learning models focus on one family of tasks, either generative or
discriminative, we explore the possibility of a unified representation learner: a model which …

CLIP-DINOiser: Teaching CLIP a Few DINO Tricks for Open-Vocabulary Semantic Segmentation

M Wysoczańska, O Siméoni, M Ramamonjisoa… - … on Computer Vision, 2024 - Springer
The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless
interaction with arbitrary text prompts. However, its lack of spatial awareness makes it …

Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy

K Vishniakov, Z Shen, Z Liu - arxiv preprint arxiv:2311.09215, 2023 - arxiv.org
Modern computer vision offers a great variety of models to practitioners, and selecting a
model from multiple options for specific applications can be challenging. Conventionally …

Improving semantic correspondence with viewpoint-guided spherical maps

O Mariotti, O Mac Aodha… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Recent self-supervised models produce visual features that are not only effective at
encoding image-level but also pixel-level semantics. They have been reported to obtain …

Understanding Video Transformers via Universal Concept Discovery

M Kowal, A Dave, R Ambrus… - Proceedings of the …, 2024 - openaccess.thecvf.com
This paper studies the problem of concept-based interpretability of transformer
representations for videos. Concretely we seek to explain the decision-making process of …

Lift: A surprisingly simple lightweight feature transform for dense vit descriptors

S Suri, M Walmer, K Gupta, A Shrivastava - European Conference on …, 2024 - Springer
We present a simple self-supervised method to enhance the performance of ViT features for
dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and …

Mim-refiner: A contrastive learning boost from intermediate pre-trained representations

B Alkin, L Miklautz, S Hochreiter… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-
trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal …