Clip in medical imaging: A comprehensive survey
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training
paradigm, successfully introduces text supervision to vision models. It has shown promising …
paradigm, successfully introduces text supervision to vision models. It has shown promising …
Video-chatgpt: Towards detailed video understanding via large vision and language models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to
interact with visual data. While there have been initial attempts for image-based …
interact with visual data. While there have been initial attempts for image-based …
Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …
Maple: Multi-modal prompt learning
Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
Self-regulating prompts: Foundational model adaptation without forgetting
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models,
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
Aligning bag of regions for open-vocabulary object detection
Pre-trained vision-language models (VLMs) learn to align vision and language
representations on large-scale datasets, where each image-text pair usually contains a bag …
representations on large-scale datasets, where each image-text pair usually contains a bag …
Pla: Language-driven open-vocabulary 3d scene understanding
Open-vocabulary scene understanding aims to localize and recognize unseen categories
beyond the annotated label space. The recent breakthrough of 2D open-vocabulary …
beyond the annotated label space. The recent breakthrough of 2D open-vocabulary …
Fine-tuned clip models are efficient video learners
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP
model. Since training on a similar scale for videos is infeasible, recent approaches focus on …
model. Since training on a similar scale for videos is infeasible, recent approaches focus on …
Cora: Adapting clip for open-vocabulary detection with region prompting and anchor pre-matching
Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects
from novel categories beyond the base categories on which the detector is trained. Recent …
from novel categories beyond the base categories on which the detector is trained. Recent …
Contextual object detection with multimodal large language models
Abstract Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-
language tasks, such as image captioning and question answering, but lack the essential …
language tasks, such as image captioning and question answering, but lack the essential …