- Academic Search

What does clip know about a red circle? visual prompt engineering for vlms

A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …

保存引用被引用次数：125 相关文章所有 7 个版本 HTML 版

Neural feature fusion fields: 3d distillation of self-supervised 2d image representations

V Tschernezki, I Laina, D Larlus… - … Conference on 3D …, 2022 - ieeexplore.ieee.org

We present Neural Feature Fusion Fields (N3F),\a method that improves dense 2D image
feature extractors when the latter are applied to the analysis of multiple images …

保存引用被引用次数：167 相关文章所有 10 个版本

Bridging the gap to real-world object-centric learning

M Seitzer, M Horn, A Zadaianchuk, D Zietlow… - arxiv preprint arxiv …, 2022 - arxiv.org

Humans naturally decompose their environment into entities at the appropriate level of
abstraction to act in the world. Allowing machine learning algorithms to derive this …

保存引用被引用次数：131 相关文章所有 8 个版本 HTML 版

Open vocabulary semantic segmentation with patch aligned contrastive learning

J Mukhoti, TY Lin, O Poursaeed… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility
function for CLIP's contrastive loss, intending to train an alignment between the patch tokens …

保存引用被引用次数：87 相关文章所有 8 个版本 HTML 版

Masqclip for open-vocabulary universal image segmentation

X Xu, T **ong, Z Ding, Z Tu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

We present a new method for open-vocabulary universal image segmentation, which is
capable of performing instance, semantic, and panoptic segmentation under a unified …

保存引用被引用次数：41 相关文章所有 4 个版本 HTML 版

Probing the 3d awareness of visual foundation models

M El Banani, A Raj, KK Maninis, A Kar… - Proceedings of the …, 2024 - openaccess.thecvf.com

Recent advances in large-scale pretraining have yielded visual foundation models with
strong capabilities. Not only can recent models generalize to arbitrary images for their …

保存引用被引用次数：66 相关文章所有 3 个版本 HTML 版

Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation

W Huang, C Wang, Y Li, R Zhang, L Fei-Fei - arxiv preprint arxiv …, 2024 - arxiv.org

Representing robotic manipulation tasks as constraints that associate the robot and the
environment is a promising way to encode desired robot behaviors. However, it remains …

保存引用被引用次数：37 相关文章所有 3 个版本 HTML 版

Spectrum-guided multi-granularity referring video object segmentation

B Miao, M Bennamoun, Y Gao… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Current referring video object segmentation (R-VOS) techniques extract conditional kernels
from encoded (low-resolution) vision-language features to segment the decoded high …

保存引用被引用次数：44 相关文章所有 7 个版本 HTML 版