Foundation models for generalist medical artificial intelligence
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI)
models is likely to usher in newfound capabilities in medicine. We propose a new paradigm …
models is likely to usher in newfound capabilities in medicine. We propose a new paradigm …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
When and why vision-language models behave like bags-of-words, and what to do about it?
Despite the success of large vision and language models (VLMs) in many downstream
applications, it is unclear how well they encode compositional information. Here, we create …
applications, it is unclear how well they encode compositional information. Here, we create …
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …
understanding of vision-language models have permeated the machine learning ecosystem …
Instructdiffusion: A generalist modeling interface for vision tasks
We present InstructDiffusion a unified and generic framework for aligning computer vision
tasks with human instructions. Unlike existing approaches that integrate prior knowledge …
tasks with human instructions. Unlike existing approaches that integrate prior knowledge …
Test-time prompt tuning for zero-shot generalization in vision-language models
Pre-trained vision-language models (eg, CLIP) have shown promising zero-shot
generalization in many downstream tasks with properly designed text prompts. Instead of …
generalization in many downstream tasks with properly designed text prompts. Instead of …
Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset
Vision and text have been fully explored in contemporary video-text foundational models,
while other modalities such as audio and subtitles in videos have not received sufficient …
while other modalities such as audio and subtitles in videos have not received sufficient …
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …
possibilities for multi-modal AGI systems. However the progress in vision and vision …
mplug-2: A modularized multi-modal foundation model across text, image and video
Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …
Unmasked teacher: Towards training-efficient video foundation models
Abstract Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …
computational costs and data scarcity. Previous VFMs rely on Image Foundation Models …