Grounding everything: Emerging localization properties in vision-language transformers
Vision-language foundation models have shown remarkable performance in various zero-
shot settings such as image retrieval classification or captioning. But so far those models …
shot settings such as image retrieval classification or captioning. But so far those models …
Clip as rnn: Segment countless visual concepts without training endeavor
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …
Diffusion feedback helps clip see better
Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world
representations across domains and modalities, has become a foundation for a variety of …
representations across domains and modalities, has become a foundation for a variety of …
Decoupling static and hierarchical motion perception for referring video segmentation
Referring video segmentation relies on natural language expressions to identify and
segment objects often emphasizing motion clues. Previous works treat a sentence as a …
segment objects often emphasizing motion clues. Previous works treat a sentence as a …
Zero-shot referring expression comprehension via structural similarity between images and captions
Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …
image corresponding to provided textual prompts which requires:(i) a fine-grained …
Primitivenet: decomposing the global constraints for referring segmentation
In referring segmentation, modeling the complicated constraints in the multimodal
information is one of the most challenging problems. As the information in a given language …
information is one of the most challenging problems. As the information in a given language …
Resmatch: Referring expression segmentation in a semi-supervised manner
Referring Expression segmentation (RES), a task that involves localizing specific instance-
level objects on the basis of free-form linguistic descriptions, has emerged as a crucial …
level objects on the basis of free-form linguistic descriptions, has emerged as a crucial …
Text promptable surgical instrument segmentation with vision-language models
In this paper, we propose a novel text promptable surgical instrument segmentation
approach to overcome challenges associated with diversity and differentiation of surgical …
approach to overcome challenges associated with diversity and differentiation of surgical …
Ref-diff: Zero-shot referring image segmentation with generative models
Zero-shot referring image segmentation is a challenging task because it aims to find an
instance segmentation mask based on the given referring descriptions, without training on …
instance segmentation mask based on the given referring descriptions, without training on …
Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation
Referring expression segmentation (RES) aims at segmenting the foreground masks of the
entities that match the descriptive natural language expression. Previous datasets and …
entities that match the descriptive natural language expression. Previous datasets and …