Openscene: 3d scene understanding with open vocabularies

S Peng, K Genova, C Jiang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a
model for a single task with supervision. We propose OpenScene, an alternative approach …

Recognize anything: A strong image tagging model

Y Zhang, X Huang, J Ma, Z Li, Z Luo… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present the Recognize Anything Model (RAM): a strong foundation model for
image tagging. RAM makes a substantial step for foundation models in computer vision …

Openmask3d: Open-vocabulary 3d instance segmentation

A Takmaz, E Fedele, RW Sumner, M Pollefeys… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce the task of open-vocabulary 3D instance segmentation. Traditional
approaches for 3D instance segmentation largely rely on existing 3D annotated datasets …

Clip surgery for better explainability with enhancement in open-vocabulary tasks

Y Li, H Wang, Y Duan, X Li - arxiv preprint arxiv:2304.05653, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision
model that has demonstrated significant benefits for downstream tasks, including many zero …

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

Simple image-level classification improves open-vocabulary object detection

R Fang, G Pang, X Bai - Proceedings of the AAAI Conference on …, 2024 - ojs.aaai.org
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set
of base categories on which the detection model is trained. Recent OVOD methods focus on …

Ceprompt: Cross-modal emotion-aware prompting for facial expression recognition

H Zhou, S Huang, F Zhang, C Xu - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Facial expression recognition (FER) remains a challenging task due to the ambiguity and
subtlety of expressions. To address this challenge, current FER methods predominantly …

Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning

X **ng, Z **ong, A Stylianou, S Sastry… - Proceedings of the …, 2024 - openaccess.thecvf.com
We study a limited label problem and present a novel approach to Single-Positive Multi-
label Learning. In the multi-label learning setting a model learns to predict multiple labels or …

A closer look at the explainability of contrastive language-image pre-training

Y Li, H Wang, Y Duan, J Zhang, X Li - Pattern Recognition, 2025 - Elsevier
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that
has shown great benefits for various tasks. However, we have identified some issues with its …

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training

Y Lin, M Chen, K Zhang, H Li, M Li, Z Yang… - Proceedings of the …, 2024 - ojs.aaai.org
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities
in open-vocabulary classification. The class token in the image encoder is trained to capture …