Openscene: 3d scene understanding with open vocabularies
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a
model for a single task with supervision. We propose OpenScene, an alternative approach …
model for a single task with supervision. We propose OpenScene, an alternative approach …
Recognize anything: A strong image tagging model
Abstract We present the Recognize Anything Model (RAM): a strong foundation model for
image tagging. RAM makes a substantial step for foundation models in computer vision …
image tagging. RAM makes a substantial step for foundation models in computer vision …
Openmask3d: Open-vocabulary 3d instance segmentation
We introduce the task of open-vocabulary 3D instance segmentation. Traditional
approaches for 3D instance segmentation largely rely on existing 3D annotated datasets …
approaches for 3D instance segmentation largely rely on existing 3D annotated datasets …
Clip surgery for better explainability with enhancement in open-vocabulary tasks
Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision
model that has demonstrated significant benefits for downstream tasks, including many zero …
model that has demonstrated significant benefits for downstream tasks, including many zero …
Knowledge-enhanced dual-stream zero-shot composed image retrieval
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …
target image given a reference image and a description without training on the triplet …
Simple image-level classification improves open-vocabulary object detection
Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set
of base categories on which the detection model is trained. Recent OVOD methods focus on …
of base categories on which the detection model is trained. Recent OVOD methods focus on …
Ceprompt: Cross-modal emotion-aware prompting for facial expression recognition
Facial expression recognition (FER) remains a challenging task due to the ambiguity and
subtlety of expressions. To address this challenge, current FER methods predominantly …
subtlety of expressions. To address this challenge, current FER methods predominantly …
Vision-Language Pseudo-Labels for Single-Positive Multi-Label Learning
We study a limited label problem and present a novel approach to Single-Positive Multi-
label Learning. In the multi-label learning setting a model learns to predict multiple labels or …
label Learning. In the multi-label learning setting a model learns to predict multiple labels or …
A closer look at the explainability of contrastive language-image pre-training
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that
has shown great benefits for various tasks. However, we have identified some issues with its …
has shown great benefits for various tasks. However, we have identified some issues with its …
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities
in open-vocabulary classification. The class token in the image encoder is trained to capture …
in open-vocabulary classification. The class token in the image encoder is trained to capture …