Zegclip: Towards adapting clip for zero-shot semantic segmentation

Z Zhou, Y Lei, B Zhang, L Liu… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a wo-stage
scheme. The general idea is to first generate class-agnostic region proposals and then feed …

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

Neural feature fusion fields: 3d distillation of self-supervised 2d image representations

V Tschernezki, I Laina, D Larlus… - … Conference on 3D …, 2022 - ieeexplore.ieee.org
We present Neural Feature Fusion Fields (N3F),\a method that improves dense 2D image
feature extractors when the latter are applied to the analysis of multiple images …

Bridging the gap to real-world object-centric learning

M Seitzer, M Horn, A Zadaianchuk, D Zietlow… - arxiv preprint arxiv …, 2022 - arxiv.org
Humans naturally decompose their environment into entities at the appropriate level of
abstraction to act in the world. Allowing machine learning algorithms to derive this …

Open vocabulary semantic segmentation with patch aligned contrastive learning

J Mukhoti, TY Lin, O Poursaeed… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …

Masqclip for open-vocabulary universal image segmentation

X Xu, T **ong, Z Ding, Z Tu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
We present a new method for open-vocabulary universal image segmentation, which is
capable of performing instance, semantic, and panoptic segmentation under a unified …

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

W Huang, C Wang, Y Li, R Zhang, L Fei-Fei - arxiv preprint arxiv …, 2024 - arxiv.org
Representing robotic manipulation tasks as constraints that associate the robot and the
environment is a promising way to encode desired robot behaviors. However, it remains …

Spectrum-guided multi-granularity referring video object segmentation

B Miao, M Bennamoun, Y Gao… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Current referring video object segmentation (R-VOS) techniques extract conditional kernels
from encoded (low-resolution) vision-language features to segment the decoded high …

Dino-tracker: Taming dino for self-supervised point tracking in a single video

N Tumanyan, A Singer, S Bagon, T Dekel - European Conference on …, 2024 - Springer
We present DINO-Tracker–a new framework for long-term dense tracking in video. The pillar
of our approach is combining test-time training on a single video, with the powerful localized …