Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
[PDF][PDF] Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models
(LVLMs) designed to perceive and understand both texts and images. Starting from the …
(LVLMs) designed to perceive and understand both texts and images. Starting from the …
Image as a foreign language: Beit pretraining for vision and vision-language tasks
A big convergence of language, vision, and multimodal pretraining is emerging. In this work,
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
Coca: Contrastive captioners are image-text foundation models
Exploring large-scale pretrained foundation models is of significant interest in computer
vision because these models can be quickly transferred to many downstream tasks. This …
vision because these models can be quickly transferred to many downstream tasks. This …
Grounded language-image pre-training
This paper presents a grounded language-image pre-training (GLIP) model for learning
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
Evaluating object hallucination in large vision-language models
Inspired by the superior language abilities of large language models (LLM), large vision-
language models (LVLM) have been recently explored by integrating powerful LLMs for …
language models (LVLM) have been recently explored by integrating powerful LLMs for …
Flava: A foundational language and vision alignment model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Pali: A jointly-scaled multilingual language-image model
Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …