Google Academic

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org

Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Salvați Citați Citat de 148 ori Articole cu conținut similar Toate cele 4 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Docci: Descriptions of connected and contrasting images

Y Onoe, S Rane, Z Berger, Y Bitton, J Cho… - … on Computer Vision, 2024 - Springer

Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T)
research. However, current datasets lack descriptions with fine-grained detail that would …

Salvați Citați Citat de 37 ori Articole cu conținut similar Toate cele 7 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

C Zhang, FD Puspitasari, S Zheng, C Li, Y Qiao… - arxiv preprint arxiv …, 2023 - arxiv.org

Segment anything model (SAM) developed by Meta AI Research has recently attracted
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …

Salvați Citați Citat de 69 ori Articole cu conținut similar Toate cele 5 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Moai: Mixture of all intelligence for large language and vision models

BK Lee, B Park, C Won Kim, Y Man Ro - European Conference on …, 2024 - Springer

The rise of large language models (LLMs) and instruction tuning has led to the current trend
of instruction-tuned large language and vision models (LLVMs). This trend involves either …

Salvați Citați Citat de 17 ori Articole cu conținut similar Toate cele 10 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Zero-shot referring expression comprehension via structural similarity between images and captions

Z Han, F Zhu, Q Lao, H Jiang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …

Salvați Citați Citat de 14 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

TOMGPT: reliable text-only training approach for cost-effective multi-modal large language model

Y Chen, Q Wang, S Wu, Y Gao, T Xu, Y Hu - ACM Transactions on …, 2024 - dl.acm.org

Multi-modal large language models (MLLMs), such as GPT-4, exhibit great comprehension
capabilities on human instruction, as well as zero-shot ability on new downstream multi …

Salvați Citați Citat de 13 ori Articole cu conținut similar Toate cele 2 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Investigating compositional challenges in vision-language models for visual grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com

Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

Salvați Citați Citat de 4 ori Articole cu conținut similar Toate cele 5 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Building vision-language models on solid foundations with masked distillation

S Sameni, K Kafle, H Tan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Abstract Recent advancements in Vision-Language Models (VLMs) have marked a
significant leap in bridging the gap between computer vision and natural language …

Salvați Citați Citat de 4 ori Articole cu conținut similar Toate cele 3 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives

M Patel, NSA Kusumba, S Cheng… - Advances in …, 2025 - proceedings.neurips.cc

Abstract Contrastive Language-Image Pretraining (CLIP) models maximize the mutual
information between text and visual modalities to learn representations. This makes the …

Salvați Citați Citat de 3 ori Articole cu conținut similar Toate cele 5 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Revisiting the role of language priors in vision-language models

Z Lin, X Chen, D Pathak, P Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org

Vision-language models (VLMs) are impactful in part because they can be applied to a
variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We …

Salvați Citați Citat de 23 ori Articole cu conținut similar Toate cele 6 versiuni Afișare ca HTML

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

Dense and aligned captions (dac) promote compositional reasoning in vl models

Foundation Models Defining a New Era in Vision: a Survey and Outlook

Docci: Descriptions of connected and contrasting images

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

Moai: Mixture of all intelligence for large language and vision models

Zero-shot referring expression comprehension via structural similarity between images and captions

TOMGPT: reliable text-only training approach for cost-effective multi-modal large language model

Investigating compositional challenges in vision-language models for visual grounding

Building vision-language models on solid foundations with masked distillation

Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives

Revisiting the role of language priors in vision-language models