A survey on open-vocabulary detection and segmentation: Past, present, and future

C Zhu, L Chen - IEEE Transactions on Pattern Analysis and …, 2024 - ieeexplore.ieee.org
As the most fundamental scene understanding tasks, object detection and segmentation
have made tremendous progress in deep learning era. Due to the expensive manual …

Towards open vocabulary learning: A survey

J Wu, X Li, S Xu, H Yuan, H Ding… - … on Pattern Analysis …, 2024 - ieeexplore.ieee.org
In the field of visual scene understanding, deep neural networks have made impressive
advancements in various core tasks like segmentation, tracking, and detection. However …

Omg-seg: Is one model good enough for all segmentation?

X Li, H Yuan, W Li, H Ding, S Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
In this work we address various segmentation tasks each traditionally tackled by distinct or
partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently …

Sclip: Rethinking self-attention for dense vision-language inference

F Wang, J Mei, A Yuille - European Conference on Computer Vision, 2024 - Springer
Recent advances in contrastive language-image pretraining (CLIP) have demonstrated
strong capabilities in zero-shot classification by aligning visual and textual features at an …

Vitamin: Designing scalable vision models in the vision-language era

J Chen, Q Yu, X Shen, A Yuille… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recent breakthroughs in vision-language models (VLMs) start a new page in the vision
community. The VLMs provide stronger and more generalizable feature embeddings …

Pink: Unveiling the power of referential comprehension for multi-modal llms

S Xuan, Q Guo, M Yang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities
in various multi-modal tasks. Nevertheless their performance in fine-grained image …

Proxyclip: Proxy attention improves clip for open-vocabulary segmentation

M Lan, C Chen, Y Ke, X Wang, L Feng… - European Conference on …, 2024 - Springer
Open-vocabulary semantic segmentation requires models to effectively integrate visual
representations with open-vocabulary semantic labels. While Contrastive Language-Image …

Clearclip: Decomposing clip representations for dense vision-language inference

M Lan, C Chen, Y Ke, X Wang, L Feng… - European Conference on …, 2024 - Springer
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially
CLIP in various open-vocabulary tasks, their application to semantic segmentation remains …

DAC-DETR: Divide the attention layers and conquer

Z Hu, Y Sun, J Wang, Y Yang - Advances in Neural …, 2023 - proceedings.neurips.cc
This paper reveals a characteristic of DEtection Transformer (DETR) that negatively impacts
its training efficacy, ie, the cross-attention and self-attention layers in DETR decoder have …

Exploring regional clues in CLIP for zero-shot semantic segmentation

Y Zhang, MH Guo, M Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
CLIP has demonstrated marked progress in visual recognition due to its powerful pre-
training on large-scale image-text pairs. However it still remains a critical challenge: how to …