OMG-Seg: Is one model good enough for all segmentation?

X Li, H Yuan, W Li, H Ding, S Wu… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
In this work we address various segmentation tasks each traditionally tackled by distinct or
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …

Sclip: Rethinking self-attention for dense vision-language inference

F Wang, J Mei, A Yuille - European Conference on Computer Vision, 2024‏ - Springer
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated
strong capabilities in zero-shot classification by aligning visual and textual features at an …

Pink: Unveiling the power of referential comprehension for multi-modal llms

S Xuan, Q Guo, M Yang… - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities
in various multi-modal tasks. Nevertheless their performance in fine-grained image …

A survey on open-vocabulary detection and segmentation: Past, present, and future

C Zhu, L Chen - IEEE Transactions on Pattern Analysis and …, 2024‏ - ieeexplore.ieee.org
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …

Open-vocabulary SAM: Segment and recognize twenty-thousand classes interactively

H Yuan, X Li, C Zhou, Y Li, K Chen, CC Loy - European Conference on …, 2024‏ - Springer
Abstract The CLIP and Segment Anything Model (SAM) are remarkable vision foundation
models (VFMs). SAM excels in segmentation tasks across diverse domains, whereas CLIP is …

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

M Lan, C Chen, Y Ke, X Wang, L Feng… - European Conference on …, 2024‏ - Springer
Open-vocabulary semantic segmentation requires models to effectively integrate visual
representations with open-vocabulary semantic labels. While Contrastive Language-Image …

DAC-DETR: Divide the attention layers and conquer

Z Hu, Y Sun, J Wang, Y Yang - Advances in Neural …, 2024‏ - proceedings.neurips.cc
This paper reveals a characteristic of DEtection Transformer (DETR) that negatively impacts
its training efficacy, ie, the cross-attention and self-attention layers in DETR decoder have …

Clearclip: Decomposing clip representations for dense vision-language inference

M Lan, C Chen, Y Ke, X Wang, L Feng… - European Conference on …, 2024‏ - Springer
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially
CLIP in various open-vocabulary tasks, their application to semantic segmentation remains …

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Clim: Contrastive language-image mosaic for region representation

S Wu, W Zhang, L Xu, S **, W Liu… - Proceedings of the AAAI …, 2024‏ - ojs.aaai.org
Detecting objects accurately from a large or open vocabulary necessitates the vision-
language alignment on region representations. However, learning such a region-text …