Probvlm: Probabilistic adapter for frozen vison-language models

U Upadhyay, S Karthik, M Mancini… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large-scale vision-language models (VLMs) like CLIP successfully find correspondences
between images and text. Through the standard deterministic map** process, an image or …

Improved probabilistic image-text representations

S Chun - arxiv preprint arxiv:2305.18171, 2023 - arxiv.org
Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the
inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic …

Seeing what you miss: Vision-language pre-training with semantic completion learning

Y Ji, R Tu, J Jiang, W Kong, C Cai… - Proceedings of the …, 2023 - openaccess.thecvf.com
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn
the correct corresponding information across different modalities. For this purpose, inspired …

Open-set recognition in the age of vision-language models

D Miller, N Sünderhauf, A Kenna, K Mason - European Conference on …, 2024 - Springer
Are vision-language models (VLMs) for open-vocabulary perception inherently open-set
models because they are trained on internet-scale datasets? We answer this question with a …

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

B He, X Jia, S Liang, T Lou, Y Liu, X Cao - arxiv preprint arxiv:2312.04913, 2023 - arxiv.org
Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples.
These adversarial examples present substantial security risks to VLP models, as they can …

LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

H Ge, Y Jiang, J Sun, K Yuan, Y Liu - ACM Transactions on Information …, 2025 - dl.acm.org
Composed image retrieval (CoIR) involves a multi-modal query of the reference image and
modification text describing the desired changes, allowing users to express image retrieval …

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

C Jiang, W Ye, H Xu, Q Ye, M Yan, J Zhang… - Proceedings of the …, 2024 - ojs.aaai.org
Abstract Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic …

Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion

Z Gao, X Jiang, X Xu, F Shen, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
As a fundamental problem in multimodal learning multimodal fusion aims to compensate for
the inherent limitations of a single modality. One challenge of multimodal fusion is that the …

Probabilistic Language-Image Pre-Training

S Chun, W Kim, S Park, S Yun - arxiv preprint arxiv:2410.18857, 2024 - arxiv.org
Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often
rely on deterministic embeddings, assuming a one-to-one correspondence between images …

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

J Zhou, X Zheng, Y Lyu, L Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Event cameras have recently been shown beneficial for practical vision tasks such as action
recognition thanks to their high temporal resolution power efficiency and reduced privacy …