Foundation models for generalist medical artificial intelligence
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI)
models is likely to usher in newfound capabilities in medicine. We propose a new paradigm …
models is likely to usher in newfound capabilities in medicine. We propose a new paradigm …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Language is not all you need: Aligning perception with language models
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Large multimodal models trained on natural documents, which interleave images and text,
outperform models trained on image-text pairs on various multimodal benchmarks …
outperform models trained on image-text pairs on various multimodal benchmarks …
Automated program repair in the era of large pre-trained language models
Automated Program Repair (APR) aims to help developers automatically patch software
bugs. However, current state-of-the-art traditional and learning-based APR techniques face …
bugs. However, current state-of-the-art traditional and learning-based APR techniques face …
Multimodal c4: An open, billion-scale corpus of images interleaved with text
In-context vision and language models like Flamingo support arbitrarily interleaved
sequences of images and text as input. This format not only enables few-shot learning via …
sequences of images and text as input. This format not only enables few-shot learning via …
Grounding language models to images for multimodal inputs and outputs
We propose an efficient method to ground pretrained text-only language models to the
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …
visual domain, enabling them to process arbitrarily interleaved image-and-text data, and …
Flamingo: a visual language model for few-shot learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …
examples is an open challenge for multimodal machine learning research. We introduce …