Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …
Maple: Multi-modal prompt learning
Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …
Visual-language prompt tuning with knowledge-guided context optimization
Prompt tuning is an effective way to adapt the pretrained visual-language model (VLM) to
the downstream task using task-related textual tokens. Representative CoOp-based works …
the downstream task using task-related textual tokens. Representative CoOp-based works …
Self-regulating prompts: Foundational model adaptation without forgetting
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models,
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …
Generalized out-of-distribution detection and beyond in vision language model era: A survey
Detecting out-of-distribution (OOD) samples is crucial for ensuring the safety of machine
learning systems and has shaped the field of OOD detection. Meanwhile, several other …
learning systems and has shaped the field of OOD detection. Meanwhile, several other …
Prompt-aligned gradient for prompt tuning
Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a
zero-shot classifier by discrete prompt design, eg, the confidence score of an image …
zero-shot classifier by discrete prompt design, eg, the confidence score of an image …
Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models
The ability to quickly learn a new task with minimal instruction-known as few-shot learning-is
a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …
a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …
Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models
A long-standing goal of AI systems is to perform complex multimodal reasoning like humans.
Recently, large language models (LLMs) have made remarkable strides in such multi-step …
Recently, large language models (LLMs) have made remarkable strides in such multi-step …
Cheap and quick: Efficient vision-language instruction tuning for large language models
Recently, growing interest has been aroused in extending the multimodal capability of large
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …
language models (LLMs), eg, vision-language (VL) learning, which is regarded as the next …
Neural prompt search
The size of vision models has grown exponentially over the last few years, especially after
the emergence of Vision Transformer. This has motivated the development of parameter …
the emergence of Vision Transformer. This has motivated the development of parameter …