Zegclip: Towards adapting clip for zero-shot semantic segmentation
Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a wo-stage
scheme. The general idea is to first generate class-agnostic region proposals and then feed …
scheme. The general idea is to first generate class-agnostic region proposals and then feed …
What does clip know about a red circle? visual prompt engineering for vlms
Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …
representations that have found numerous applications, from zero-shot classification to text …
Neural feature fusion fields: 3d distillation of self-supervised 2d image representations
We present Neural Feature Fusion Fields (N3F),\a method that improves dense 2D image
feature extractors when the latter are applied to the analysis of multiple images …
feature extractors when the latter are applied to the analysis of multiple images …
Bridging the gap to real-world object-centric learning
Humans naturally decompose their environment into entities at the appropriate level of
abstraction to act in the world. Allowing machine learning algorithms to derive this …
abstraction to act in the world. Allowing machine learning algorithms to derive this …
Open vocabulary semantic segmentation with patch aligned contrastive learning
Abstract We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …
Masqclip for open-vocabulary universal image segmentation
We present a new method for open-vocabulary universal image segmentation, which is
capable of performing instance, semantic, and panoptic segmentation under a unified …
capable of performing instance, semantic, and panoptic segmentation under a unified …
Probing the 3d awareness of visual foundation models
Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …
strong capabilities. Not only can recent models generalize to arbitrary images for their …
Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation
Representing robotic manipulation tasks as constraints that associate the robot and the
environment is a promising way to encode desired robot behaviors. However, it remains …
environment is a promising way to encode desired robot behaviors. However, it remains …
Spectrum-guided multi-granularity referring video object segmentation
Current referring video object segmentation (R-VOS) techniques extract conditional kernels
from encoded (low-resolution) vision-language features to segment the decoded high …
from encoded (low-resolution) vision-language features to segment the decoded high …
Dino-tracker: Taming dino for self-supervised point tracking in a single video
We present DINO-Tracker–a new framework for long-term dense tracking in video. The pillar
of our approach is combining test-time training on a single video, with the powerful localized …
of our approach is combining test-time training on a single video, with the powerful localized …