Grounding everything: Emerging localization properties in vision-language transformers

W Bousselham, F Petersen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language foundation models have shown remarkable performance in various zero-
shot settings such as image retrieval classification or captioning. But so far those models …

Clip as rnn: Segment countless visual concepts without training endeavor

S Sun, R Li, P Torr, X Gu, S Li - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …

Diffusion feedback helps clip see better

W Wang, Q Sun, F Zhang, Y Tang, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …

Decoupling static and hierarchical motion perception for referring video segmentation

S He, H Ding - Proceedings of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Referring video segmentation relies on natural language expressions to identify and
segment objects often emphasizing motion clues. Previous works treat a sentence as a …

Zero-shot referring expression comprehension via structural similarity between images and captions

Z Han, F Zhu, Q Lao, H Jiang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …

Primitivenet: decomposing the global constraints for referring segmentation

C Liu, X Jiang, H Ding - Visual Intelligence, 2024 - Springer
In referring segmentation, modeling the complicated constraints in the multimodal
information is one of the most challenging problems. As the information in a given language …

Resmatch: Referring expression segmentation in a semi-supervised manner

Y Zang, R Cao, C Fu, D Zhu, M Zhang, W Hu, L Zhu… - Information …, 2025 - Elsevier
Referring Expression segmentation (RES), a task that involves localizing specific instance-
level objects on the basis of free-form linguistic descriptions, has emerged as a crucial …

Text promptable surgical instrument segmentation with vision-language models

Z Zhou, O Alabi, M Wei… - Advances in Neural …, 2023 - proceedings.neurips.cc
In this paper, we propose a novel text promptable surgical instrument segmentation
approach to overcome challenges associated with diversity and differentiation of surgical …

Ref-diff: Zero-shot referring image segmentation with generative models

M Ni, Y Zhang, K Feng, X Li, Y Guo, W Zuo - arxiv preprint arxiv …, 2023 - arxiv.org
Zero-shot referring image segmentation is a challenging task because it aims to find an
instance segmentation mask based on the given referring descriptions, without training on …

Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation

W Wang, T Yue, Y Zhang, L Guo… - Proceedings of the …, 2024 - openaccess.thecvf.com
Referring expression segmentation (RES) aims at segmenting the foreground masks of the
entities that match the descriptive natural language expression. Previous datasets and …