Probvlm: Probabilistic adapter for frozen vison-language models
Large-scale vision-language models (VLMs) like CLIP successfully find correspondences
between images and text. Through the standard deterministic map** process, an image or …
between images and text. Through the standard deterministic map** process, an image or …
Improved probabilistic image-text representations
S Chun - arxiv preprint arxiv:2305.18171, 2023 - arxiv.org
Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the
inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic …
inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic …
Seeing what you miss: Vision-language pre-training with semantic completion learning
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn
the correct corresponding information across different modalities. For this purpose, inspired …
the correct corresponding information across different modalities. For this purpose, inspired …
Open-set recognition in the age of vision-language models
Are vision-language models (VLMs) for open-vocabulary perception inherently open-set
models because they are trained on internet-scale datasets? We answer this question with a …
models because they are trained on internet-scale datasets? We answer this question with a …
Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation
Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples.
These adversarial examples present substantial security risks to VLP models, as they can …
These adversarial examples present substantial security risks to VLP models, as they can …
LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model
Composed image retrieval (CoIR) involves a multi-modal query of the reference image and
modification text describing the desired changes, allowing users to express image retrieval …
modification text describing the desired changes, allowing users to express image retrieval …
TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training
Abstract Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic …
modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic …
Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion
As a fundamental problem in multimodal learning multimodal fusion aims to compensate for
the inherent limitations of a single modality. One challenge of multimodal fusion is that the …
the inherent limitations of a single modality. One challenge of multimodal fusion is that the …
Probabilistic Language-Image Pre-Training
Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often
rely on deterministic embeddings, assuming a one-to-one correspondence between images …
rely on deterministic embeddings, assuming a one-to-one correspondence between images …
ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More
Event cameras have recently been shown beneficial for practical vision tasks such as action
recognition thanks to their high temporal resolution power efficiency and reduced privacy …
recognition thanks to their high temporal resolution power efficiency and reduced privacy …