Knowledge-enhanced dual-stream zero-shot composed image retrieval
We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …
target image given a reference image and a description without training on the triplet …
Towards flexible perception with visual memory
Training a neural network is a monolithic endeavor, akin to carving knowledge into stone:
once the process is completed, editing the knowledge in a network is nearly impossible …
once the process is completed, editing the knowledge in a network is nearly impossible …
Anchor-based Robust Finetuning of Vision-Language Models
We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD)
generalization. We address two types of OOD generalization ie i) domain shift such as …
generalization. We address two types of OOD generalization ie i) domain shift such as …
Context-aware multimodal pretraining
Large-scale multimodal representation learning successfully optimizes for zero-shot transfer
at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of …
at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of …