Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Visual instruction tuning

H Liu, C Li, Q Wu, YJ Lee - Advances in neural information …, 2024 - proceedings.neurips.cc
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …

Improved baselines with visual instruction tuning

H Liu, C Li, Y Li, YJ Lee - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
Large multimodal models (LMM) have recently shown encouraging progress with visual
instruction tuning. In this paper we present the first systematic study to investigate the design …

[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond

J Bai, S Bai, S Yang, S Wang… - arxiv preprint …, 2023 - storage.prod.researchhub.com
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …

Scaling up gans for text-to-image synthesis

M Kang, JY Zhu, R Zhang, J Park… - Proceedings of the …, 2023 - openaccess.thecvf.com
The recent success of text-to-image synthesis has taken the world by storm and captured the
general public's imagination. From a technical standpoint, it also marked a drastic change in …

Video-llava: Learning united visual representation by alignment before projection

B Lin, Y Ye, B Zhu, J Cui, M Ning, P **… - arxiv preprint arxiv …, 2023 - arxiv.org
The Large Vision-Language Model (LVLM) has enhanced the performance of various
downstream tasks in visual-language understanding. Most existing approaches encode …

Open-vocabulary panoptic segmentation with text-to-image diffusion models

J Xu, S Liu, A Vahdat, W Byeon… - Proceedings of the …, 2023 - openaccess.thecvf.com
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies
pre-trained text-image diffusion and discriminative models to perform open-vocabulary …

Sdxl: Improving latent diffusion models for high-resolution image synthesis

D Podell, Z English, K Lacey, A Blattmann… - arxiv preprint arxiv …, 2023 - arxiv.org
We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to
previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone …

Yi: Open foundation models by 01. ai

A Young, B Chen, C Li, C Huang, G Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce the Yi model family, a series of language and multimodal models that
demonstrate strong multi-dimensional capabilities. The Yi model family is based on 6B and …