Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Docci: Descriptions of connected and contrasting images

Y Onoe, S Rane, Z Berger, Y Bitton, J Cho… - … on Computer Vision, 2024 - Springer
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T)
research. However, current datasets lack descriptions with fine-grained detail that would …

A survey on segment anything model (sam): Vision foundation model meets prompt engineering

C Zhang, FD Puspitasari, S Zheng, C Li, Y Qiao… - arxiv preprint arxiv …, 2023 - arxiv.org
Segment anything model (SAM) developed by Meta AI Research has recently attracted
significant attention. Trained on a large segmentation dataset of over 1 billion masks, SAM is …

Moai: Mixture of all intelligence for large language and vision models

BK Lee, B Park, C Won Kim, Y Man Ro - European Conference on …, 2024 - Springer
The rise of large language models (LLMs) and instruction tuning has led to the current trend
of instruction-tuned large language and vision models (LLVMs). This trend involves either …

Zero-shot referring expression comprehension via structural similarity between images and captions

Z Han, F Zhu, Q Lao, H Jiang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …

TOMGPT: reliable text-only training approach for cost-effective multi-modal large language model

Y Chen, Q Wang, S Wu, Y Gao, T Xu, Y Hu - ACM Transactions on …, 2024 - dl.acm.org
Multi-modal large language models (MLLMs), such as GPT-4, exhibit great comprehension
capabilities on human instruction, as well as zero-shot ability on new downstream multi …

Investigating compositional challenges in vision-language models for visual grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

Building vision-language models on solid foundations with masked distillation

S Sameni, K Kafle, H Tan… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Recent advancements in Vision-Language Models (VLMs) have marked a
significant leap in bridging the gap between computer vision and natural language …

Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives

M Patel, NSA Kusumba, S Cheng… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Contrastive Language-Image Pretraining (CLIP) models maximize the mutual
information between text and visual modalities to learn representations. This makes the …

Revisiting the role of language priors in vision-language models

Z Lin, X Chen, D Pathak, P Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Vision-language models (VLMs) are impactful in part because they can be applied to a
variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We …