COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

S Kim, R **ao, MI Georgescu, S Alaniz… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant
advancements in various vision and language tasks. However, the global nature of …