Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning

H Bansal, N Singhi, Y Yang, F Yin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Multimodal contrastive pretraining has been used to train multimodal representation models,
such as CLIP, on large amounts of paired image-text data. However, previous studies have …

Spurious correlations in machine learning: A survey

W Ye, G Zheng, X Cao, Y Ma, A Zhang - arxiv preprint arxiv:2402.12715, 2024 - arxiv.org
Machine learning systems are known to be sensitive to spurious correlations between non-
essential features of the inputs (eg, background, texture, and secondary objects) and the …

Robust learning with progressive data expansion against spurious correlation

Y Deng, Y Yang, B Mirzasoleiman… - Advances in neural …, 2023 - proceedings.neurips.cc
While deep learning models have shown remarkable performance in various tasks, they are
susceptible to learning non-generalizable _spurious features_ rather than the core features …

Distilling vision-language models on millions of videos

Y Zhao, L Zhao, X Zhou, J Wu… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recent advance in vision-language models is largely attributed to the abundance of
image-text data. We aim to replicate this success for video-language models but there …

Sieve: Multimodal dataset pruning using image captioning models

A Mahmoud, M Elhoushi, A Abbas… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-
crawled datasets. This underscores the critical need for dataset pruning as the quality of …

Calibrating multi-modal representations: A pursuit of group robustness without annotations

C You, Y Mint, W Dai, JS Sekhon… - 2024 IEEE/CVF …, 2024 - ieeexplore.ieee.org
Fine-tuning pre-trained vision-language models, like CLIP, has yielded success on diverse
downstream tasks. However, several pain points persist for this paradigm:(i) directly tuning …

A Sober Look at the Robustness of CLIPs to Spurious Features

Q Wang, Y Lin, Y Chen, L Schmidt… - Advances in Neural …, 2025 - proceedings.neurips.cc
Large vision language models, such as CLIP, demonstrate impressive robustness to
spurious features than single-modal models trained on ImageNet. However, existing test …

Fd-align: Feature discrimination alignment for fine-tuning pre-trained models in few-shot learning

K Song, H Ma, B Zou, H Zhang… - Advances in Neural …, 2023 - proceedings.neurips.cc
Due to the limited availability of data, existing few-shot learning methods trained from
scratch fail to achieve satisfactory performance. In contrast, large-scale pre-trained models …

Prompting is a double-edged sword: improving worst-group robustness of foundation models

A Setlur, S Garg, V Smith, S Levine - Forty-first International …, 2024 - openreview.net
Machine learning models fail catastrophically under distribution shift, but a surprisingly
effective way to empirically improve robustness to some types of shift (* eg*, Imagenet-A/C) …

Zero-shot robustification of zero-shot models

D Adila, C Shin, L Cai, F Sala - arxiv preprint arxiv:2309.04344, 2023 - arxiv.org
Zero-shot inference is a powerful paradigm that enables the use of large pretrained models
for downstream classification tasks without further training. However, these models are …