OMG-Seg: Is one model good enough for all segmentation?
In this work we address various segmentation tasks each traditionally tackled by distinct or
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …
Sclip: Rethinking self-attention for dense vision-language inference
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated
strong capabilities in zero-shot classification by aligning visual and textual features at an …
strong capabilities in zero-shot classification by aligning visual and textual features at an …
Pink: Unveiling the power of referential comprehension for multi-modal llms
Abstract Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities
in various multi-modal tasks. Nevertheless their performance in fine-grained image …
in various multi-modal tasks. Nevertheless their performance in fine-grained image …
A survey on open-vocabulary detection and segmentation: Past, present, and future
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …
have made tremendous progress in deep learning era. Due to the expensive manual …
Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively
Abstract The CLIP and Segment Anything Model (SAM) are remarkable vision foundation
models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is …
models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is …
Proxyclip: Proxy attention improves clip for open-vocabulary segmentation
Open-vocabulary semantic segmentation requires models to effectively integrate visual
representations with open-vocabulary semantic labels. While Contrastive Language-Image …
representations with open-vocabulary semantic labels. While Contrastive Language-Image …
DAC-DETR: Divide the attention layers and conquer
This paper reveals a characteristic of DEtection Transformer (DETR) that negatively impacts
its training efficacy, ie, the cross-attention and self-attention layers in DETR decoder have …
its training efficacy, ie, the cross-attention and self-attention layers in DETR decoder have …
Clearclip: Decomposing clip representations for dense vision-language inference
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially
CLIP in various open-vocabulary tasks, their application to semantic segmentation remains …
CLIP in various open-vocabulary tasks, their application to semantic segmentation remains …
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …
community. The VLMs provide stronger and more generalizable feature embeddings …
Clim: Contrastive language-image mosaic for region representation
Detecting objects accurately from a large or open vocabulary necessitates the vision-
language alignment on region representations. However, learning such a region-text …
language alignment on region representations. However, learning such a region-text …