Clip in medical imaging: A comprehensive survey

Z Zhao, Y Liu, H Wu, M Wang, Y Li, S Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training
paradigm, successfully introduces text supervision to vision models. It has shown promising …

Visual tuning

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2024 - dl.acm.org
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

A systematic survey of prompt engineering on vision-language foundation models

J Gu, Z Han, S Chen, A Beirami, B He, G Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Prompt engineering is a technique that involves augmenting a large pre-trained model with
task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be …

A pilot study of query-free adversarial attack against stable diffusion

H Zhuang, Y Zhang, S Liu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Despite the record-breaking performance in Text-to-Image (T2I) generation by Stable
Diffusion, less research attention is paid to its adversarial robustness. In this work, we study …

One prompt word is enough to boost adversarial robustness for pre-trained vision-language models

L Li, H Guan, J Qiu, M Spratling - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Large pre-trained Vision-Language Models (VLMs) like CLIP despite having
remarkable generalization ability are highly vulnerable to adversarial examples. This work …

Robust clip: Unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models

C Schlarmann, ND Singh, F Croce, M Hein - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly
used for various real-world tasks. Prior work has shown that these models are highly …

Towards calibrated robust fine-tuning of vision-language models

C Oh, H Lim, M Kim, D Han, S Yun… - Advances in …, 2025 - proceedings.neurips.cc
Improving out-of-distribution (OOD) generalization during in-distribution (ID) adaptation is a
primary goal of robust fine-tuning of zero-shot models beyond naive fine-tuning. However …

Not all prompts are secure: A switchable backdoor attack against pre-trained vision transfomers

S Yang, J Bai, K Gao, Y Yang, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
Given the power of vision transformers a new learning paradigm pre-training and then
prompting makes it more efficient and effective to address downstream visual recognition …

Imagenet-d: Benchmarking neural network robustness on diffusion synthetic object

C Zhang, F Pan, J Kim, IS Kweon… - Proceedings of the …, 2024 - openaccess.thecvf.com
We establish rigorous benchmarks for visual perception robustness. Synthetic images such
as ImageNet-C ImageNet-9 and Stylized ImageNet provide specific type of evaluation over …

Pre-trained model guided fine-tuning for zero-shot adversarial robustness

S Wang, J Zhang, Z Yuan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Large-scale pre-trained vision-language models like CLIP have demonstrated impressive
performance across various tasks and exhibit remarkable zero-shot generalization capability …