Google Akademik

E Bugliarello, L Sartran, A Agrawal… - arxiv preprint arxiv …, 2023 - arxiv.org

While pretraining on large-scale image-text data from the Web has facilitated rapid progress
on many vision-and-language (V&L) tasks, recent work has demonstrated that pretrained …

Kaydet Alıntı yap Alıntılanma sayısı: 20 İlgili makaleler 5 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] thecvf.com

Grounded Image Text Matching with Mismatched Relation Reasoning

Y Wu, Y Wei, H Wang, Y Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract This paper introduces Grounded Image Text Matching with Mismatched Relation
(GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding …

Kaydet Alıntı yap Alıntılanma sayısı: 7 İlgili makaleler 5 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] thecvf.com

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com

Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

Kaydet Alıntı yap Alıntılanma sayısı: 4 İlgili makaleler HTML olarak görüntüle

[Free GPT-4]

[PDF] thecvf.com

Towards an Exhaustive Evaluation of Vision-Language Foundation Models

E Salin, S Ayache, B Favre - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Vision-language foundation models have had considerable increase in performances in the
last few years. However, there is still a lack of comprehensive evaluation methods able to …

Kaydet Alıntı yap Alıntılanma sayısı: 5 İlgili makaleler 6 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Weakly-supervised learning of visual relations in multimodal pretraining

E Bugliarello, A Nematzadeh, LA Hendricks - arxiv preprint arxiv …, 2023 - arxiv.org

Recent work in vision-and-language pretraining has investigated supervised signals from
object detection data to learn better, fine-grained multimodal representations. In this work …

Kaydet Alıntı yap Alıntılanma sayısı: 6 İlgili makaleler 4 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] ieee.org

Analyzing the Robustness of Vision & Language Models

A Shirnin, N Andreev, S Potapova… - … /ACM Transactions on …, 2024 - ieeexplore.ieee.org

We present an approach to evaluate the robustness of pre-trained vision and language
(V&L) models to noise in input data. Given a source image/text, we perturb it using standard …

Kaydet Alıntı yap Alıntılanma sayısı: 2 İlgili makaleler

[Free GPT-4]

[PDF] arxiv.org

Semantic composition in visually grounded language models

R Pandey - arxiv preprint arxiv:2305.16328, 2023 - arxiv.org

What is sentence meaning and its ideal representation? Much of the expressive power of
human language derives from semantic composition, the mind's ability to represent meaning …

Kaydet Alıntı yap Alıntılanma sayısı: 1 İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

MASS: Overcoming Language Bias in Image-Text Matching

J Chung, S Lim, S Lee, Y Yu - arxiv preprint arxiv:2501.11469, 2025 - arxiv.org

Pretrained visual-language models have made significant advancements in multimodal
tasks, including image-text retrieval. However, a major challenge in image-text matching lies …

Kaydet Alıntı yap İlgili makaleler HTML olarak görüntüle

[Free GPT-4]

[PDF] arxiv.org

Extract Free Dense Misalignment from CLIP

JY Nam, J Im, W Kim, T Kil - arxiv preprint arxiv:2412.18404, 2024 - arxiv.org

Recent vision-language foundation models still frequently produce outputs misaligned with
their inputs, evidenced by object hallucination in captioning and prompt misalignment in the …

Kaydet Alıntı yap İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

From Pixels to Explanations: Uncovering the Reasoning Process in Visual Question Answering

S Zhang, J Liu, Z Wei - Proceedings of the 5th ACM International …, 2023 - dl.acm.org

Visual reasoning requires models to construct a reasoning process towards the final
decision. Previous studies have used attention maps or textual explanations to illustrate the …

Kaydet Alıntı yap İlgili makaleler

Uyarı oluştur

Alıntı yap

Gelişmiş arama

Kitaplığım'a kaydedildi

Do vision-and-language transformers learn grounded predicate-noun dependencies?

Measuring progress in fine-grained vision-and-language understanding

Grounded Image Text Matching with Mismatched Relation Reasoning

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Towards an Exhaustive Evaluation of Vision-Language Foundation Models

Weakly-supervised learning of visual relations in multimodal pretraining

Analyzing the Robustness of Vision & Language Models

Semantic composition in visually grounded language models

MASS: Overcoming Language Bias in Image-Text Matching

Extract Free Dense Misalignment from CLIP

From Pixels to Explanations: Uncovering the Reasoning Process in Visual Question Answering