Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025‏ - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Cogvlm: Visual expert for pretrained language models

W Wang, Q Lv, W Yu, W Hong, J Qi… - Advances in …, 2025‏ - proceedings.neurips.cc
We introduce CogVLM, a powerful open-source visual language foundation model. Different
from the popular\emph {shallow alignment} method which maps image features into the …

Sharegpt4v: Improving large multi-modal models with better captions

L Chen, J Li, X Dong, P Zhang, C He, J Wang… - … on Computer Vision, 2024‏ - Springer
Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024‏ - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Yolo-world: Real-time open-vocabulary object detection

T Cheng, L Song, Y Ge, W Liu… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Abstract The You Only Look Once (YOLO) series of detectors have established themselves
as efficient and practical tools. However their reliance on predefined and trained object …

Grounded sam: Assembling open-world models for diverse visual tasks

T Ren, S Liu, A Zeng, J Lin, K Li, H Cao, J Chen… - arxiv preprint arxiv …, 2024‏ - arxiv.org
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to
combine with the segment anything model (SAM). This integration enables the detection and …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B **e, Q Sun, L Wu… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Internimage: Exploring large-scale vision foundation models with deformable convolutions

W Wang, J Dai, Z Chen, Z Huang, Z Li… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Compared to the great progress of large-scale vision transformers (ViTs) in recent years,
large-scale models based on convolutional neural networks (CNNs) are still in an early …