Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
Prompting large language models with answer heuristics for knowledge-based visual question answering
Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …
beyond the image to answer the question. Early studies retrieve required knowledge from …
Internvideo: General video foundation models via generative and discriminative learning
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …
downstream tasks in computer vision. However, most existing vision foundation models …
Internvid: A large-scale video-text dataset for multimodal understanding and generation
This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables
learning powerful and transferable video-text representations for multimodal understanding …
learning powerful and transferable video-text representations for multimodal understanding …
Laion-5b: An open large-scale dataset for training next generation image-text models
C Schuhmann, R Beaumont, R Vencu… - Advances in neural …, 2022 - proceedings.neurips.cc
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …
training on large amounts of noisy image-text data, without relying on expensive accurate …
Gpt-4v (ision) is a generalist web agent, if grounded
The recent development on large multimodal models (LMMs), especially GPT-4V (ision) and
Gemini, has been quickly expanding the capability boundaries of multimodal models …
Gemini, has been quickly expanding the capability boundaries of multimodal models …
Git: A generative image-to-text transformer for vision and language
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …
vision-language tasks such as image/video captioning and question answering. While …