Transformer for object re-identification: A survey
Abstract Object Re-identification (Re-ID) aims to identify specific objects across different
times and scenes, which is a widely researched task in computer vision. For a prolonged …
times and scenes, which is a widely researched task in computer vision. For a prolonged …
Probing the 3d awareness of visual foundation models
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …
strong capabilities. Not only can recent models generalize to arbitrary images for their …
Which tokens to use? investigating token reduction in vision transformers
Since the introduction of the Vision Transformer (ViT), researchers have sought to make ViTs
more efficient by removing redundant information in the processed tokens. While different …
more efficient by removing redundant information in the processed tokens. While different …
Diffusion models beat gans on image classification
While many unsupervised learning models focus on one family of tasks, either generative or
discriminative, we explore the possibility of a unified representation learner: a model which …
discriminative, we explore the possibility of a unified representation learner: a model which …
CLIP-DINOiser: Teaching CLIP a Few DINO Tricks for Open-Vocabulary Semantic Segmentation
The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless
interaction with arbitrary text prompts. However, its lack of spatial awareness makes it …
interaction with arbitrary text prompts. However, its lack of spatial awareness makes it …
Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy
Modern computer vision offers a great variety of models to practitioners, and selecting a
model from multiple options for specific applications can be challenging. Conventionally …
model from multiple options for specific applications can be challenging. Conventionally …
Improving semantic correspondence with viewpoint-guided spherical maps
Recent self-supervised models produce visual features that are not only effective at
encoding image-level but also pixel-level semantics. They have been reported to obtain …
encoding image-level but also pixel-level semantics. They have been reported to obtain …
Understanding Video Transformers via Universal Concept Discovery
This paper studies the problem of concept-based interpretability of transformer
representations for videos. Concretely we seek to explain the decision-making process of …
representations for videos. Concretely we seek to explain the decision-making process of …
Lift: A surprisingly simple lightweight feature transform for dense vit descriptors
We present a simple self-supervised method to enhance the performance of ViT features for
dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and …
dense downstream tasks. Our Lightweight Feature Transform (LiFT) is a straightforward and …
Mim-refiner: A contrastive learning boost from intermediate pre-trained representations
We introduce MIM (Masked Image Modeling)-Refiner, a contrastive learning boost for pre-
trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal …
trained MIM models. The motivation behind MIM-Refiner is rooted in the insight that optimal …