Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Semantic image segmentation: Two decades of research
Semantic image segmentation (SiS) plays a fundamental role in a broad variety of computer
vision applications, providing key information for the global understanding of an image. This …
vision applications, providing key information for the global understanding of an image. This …
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
In this paper, we develop an open-set object detector, called Grounding DINO, by marrying
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
Unleashing text-to-image diffusion models for visual perception
Diffusion models (DMs) have become the new trend of generative models and have
demonstrated a powerful ability of conditional synthesis. Among those, text-to-image …
demonstrated a powerful ability of conditional synthesis. Among those, text-to-image …
Scalable 3d captioning with pretrained models
We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects.
This approach utilizes pretrained models from image captioning, image-text alignment, and …
This approach utilizes pretrained models from image captioning, image-text alignment, and …
Gres: Generalized referring expression segmentation
Abstract Referring Expression Segmentation (RES) aims to generate a segmentation mask
for the object described by a given language expression. Existing classic RES datasets and …
for the object described by a given language expression. Existing classic RES datasets and …
Tip-adapter: Training-free adaption of clip for few-shot classification
Abstract Contrastive Vision-Language Pre-training, known as CLIP, has provided a new
paradigm for learning visual representations using large-scale image-text pairs. It shows …
paradigm for learning visual representations using large-scale image-text pairs. It shows …
Scaling open-vocabulary image segmentation with image-level labels
We design an open-vocabulary image segmentation model to organize an image into
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …
meaningful regions indicated by arbitrary texts. Recent works (CLIP and ALIGN), despite …
Pointclip: Point cloud understanding by clip
Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training
(CLIP) have shown inspirational performance on 2D visual recognition, which learns to …
(CLIP) have shown inspirational performance on 2D visual recognition, which learns to …
What does clip know about a red circle? visual prompt engineering for vlms
A Shtedritski, C Rupprecht… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot classification to text …
representations that have found numerous applications, from zero-shot classification to text …