- Academic Search

Y Suo, F Ma, L Zhu, Y Yang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com

We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the
target image given a reference image and a description without training on the triplet …

Enregistrer Citer Cité 13 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Grounding language models for visual entity recognition

Z **
a given query image to one of the 6 million existing entities in Wikipedia. One way of …

Enregistrer Citer Cité 5 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Towards flexible perception with visual memory

R Geirhos, P Jaini, A Stone, S Medapati, X Yi… - arxiv preprint arxiv …, 2024 - arxiv.org

Training a neural network is a monolithic endeavor, akin to carving knowledge into stone:
once the process is completed, editing the knowledge in a network is nearly impossible …

Enregistrer Citer Cité 2 fois Autres articles Version HTML

[Free GPT-4]

[PDF] thecvf.com

Anchor-based Robust Finetuning of Vision-Language Models

J Han, Z Lin, Z Sun, Y Gao, K Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com

We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD)
generalization. We address two types of OOD generalization ie i) domain shift such as …

Enregistrer Citer Cité 6 fois Autres articles Les 3 versions Free GPT-4 Version HTML

[Free GPT-4]

[PDF] arxiv.org

Context-aware multimodal pretraining

K Roth, Z Akata, D Damen, I Balažević… - arxiv preprint arxiv …, 2024 - arxiv.org

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer
at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of …

Enregistrer Citer Cité 1 fois Autres articles Version HTML

Créer l'alerte

Citer

Recherche avancée

Enregistré dans Ma bibliothèque

Retrieval-enhanced contrastive vision-text models

Knowledge-enhanced dual-stream zero-shot composed image retrieval

Grounding language models for visual entity recognition

Towards flexible perception with visual memory

Anchor-based Robust Finetuning of Vision-Language Models

Context-aware multimodal pretraining