- Academic Search

U Upadhyay, S Karthik, M Mancini… - Proceedings of the …, 2023 - openaccess.thecvf.com

Large-scale vision-language models (VLMs) like CLIP successfully find correspondences
between images and text. Through the standard deterministic map** process, an image or …

Save Cite Cited by 22 Related articles All 9 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Improved probabilistic image-text representations

S Chun - arxiv preprint arxiv:2305.18171, 2023 - arxiv.org

Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the
inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic …

Save Cite Cited by 26 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Seeing what you miss: Vision-language pre-training with semantic completion learning

Y Ji, R Tu, J Jiang, W Kong, C Cai… - Proceedings of the …, 2023 - openaccess.thecvf.com

Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn
the correct corresponding information across different modalities. For this purpose, inspired …

Save Cite Cited by 11 Related articles All 5 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] arxiv.org

Open-set recognition in the age of vision-language models

D Miller, N Sünderhauf, A Kenna, K Mason - European Conference on …, 2024 - Springer

Are vision-language models (VLMs) for open-vocabulary perception inherently open-set
models because they are trained on internet-scale datasets? We answer this question with a …

Save Cite Cited by 3 Related articles All 2 versions Free GPT-4

[Free GPT-4]

[PDF] arxiv.org

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

B He, X Jia, S Liang, T Lou, Y Liu, X Cao - arxiv preprint arxiv:2312.04913, 2023 - arxiv.org

Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples.
These adversarial examples present substantial security risks to VLP models, as they can …

Save Cite Cited by 25 Related articles All 2 versions Free GPT-4 View as HTML

LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

H Ge, Y Jiang, J Sun, K Yuan, Y Liu - ACM Transactions on Information …, 2025 - dl.acm.org

Composed image retrieval (CoIR) involves a multi-modal query of the reference image and
modification text describing the desired changes, allowing users to express image retrieval …

Save Cite Cited by 2 Related articles

[Free GPT-4]

[PDF] aaai.org

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

C Jiang, W Ye, H Xu, Q Ye, M Yan, J Zhang… - Proceedings of the …, 2024 - ojs.aaai.org

Abstract Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and linguistic …

Save Cite Cited by 2 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion

Z Gao, X Jiang, X Xu, F Shen, Y Li… - Proceedings of the …, 2024 - openaccess.thecvf.com

As a fundamental problem in multimodal learning multimodal fusion aims to compensate for
the inherent limitations of a single modality. One challenge of multimodal fusion is that the …

Save Cite Cited by 5 Related articles View as HTML

[Free GPT-4]

[PDF] arxiv.org

Probabilistic Language-Image Pre-Training

S Chun, W Kim, S Park, S Yun - arxiv preprint arxiv:2410.18857, 2024 - arxiv.org

Vision-language models (VLMs) embed aligned image-text pairs into a joint space but often
rely on deterministic embeddings, assuming a one-to-one correspondence between images …

Save Cite Cited by 1 Related articles All 3 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] thecvf.com

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

J Zhou, X Zheng, Y Lyu, L Wang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Event cameras have recently been shown beneficial for practical vision tasks such as action
recognition thanks to their high temporal resolution power efficiency and reduced privacy …

Save Cite Cited by 13 Related articles All 3 versions Free GPT-4 View as HTML

Cite

Advanced search

Saved to My library

Probvlm: Probabilistic adapter for frozen vison-language models

Improved probabilistic image-text representations

Seeing what you miss: Vision-language pre-training with semantic completion learning

Open-set recognition in the age of vision-language models

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training

Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion

Probabilistic Language-Image Pre-Training

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More