Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Minigpt-4: Enhancing vision-language understanding with advanced large language models
The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly
generating websites from handwritten text and identifying humorous elements within …
generating websites from handwritten text and identifying humorous elements within …
Laion-5b: An open large-scale dataset for training next generation image-text models
C Schuhmann, R Beaumont, R Vencu… - Advances in …, 2022 - proceedings.neurips.cc
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …
training on large amounts of noisy image-text data, without relying on expensive accurate …
[PDF][PDF] Scaling autoregressive models for content-rich text-to-image generation
Abstract We present the Pathways [1] Autoregressive Text-to-Image (Parti) model, which
generates high-fidelity photorealistic images and supports content-rich synthesis involving …
generates high-fidelity photorealistic images and supports content-rich synthesis involving …
Flava: A foundational language and vision alignment model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
The availability of large-scale image captioning and visual question answering datasets has
contributed significantly to recent successes in vision-and-language pre-training. However …
contributed significantly to recent successes in vision-and-language pre-training. However …
Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip
Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing
objects from an open set of categories in diverse environments. One way to address this …
objects from an open set of categories in diverse environments. One way to address this …
Winoground: Probing vision and language models for visio-linguistic compositionality
We present a novel task and dataset for evaluating the ability of vision and language models
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …
to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two …
Vision-language models for vision tasks: A survey
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …
(DNNs) training, and they usually train a DNN for each single visual recognition task …