Transformers in vision: A survey
Astounding results from Transformer models on natural language tasks have intrigued the
vision community to study their application to computer vision problems. Among their salient …
vision community to study their application to computer vision problems. Among their salient …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Visual instruction tuning
Instruction tuning large language models (LLMs) using machine-generated instruction-
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …
following data has been shown to improve zero-shot capabilities on new tasks, but the idea …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Large-scale pre-training methods of learning cross-modal representations on image-text
pairs are becoming popular for vision-language tasks. While existing methods simply …
pairs are becoming popular for vision-language tasks. While existing methods simply …
Diffusion-based generation, optimization, and planning in 3d scenes
We introduce SceneDiffuser, a conditional generative model for 3D scene understanding.
SceneDiffuser provides a unified model for solving scene-conditioned generation …
SceneDiffuser provides a unified model for solving scene-conditioned generation …
How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
Large-scale adversarial training for vision-and-language representation learning
We present VILLA, the first known effort on large-scale adversarial training for vision-and-
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …
language (V+ L) representation learning. VILLA consists of two training stages:(i) task …
History aware multimodal transformer for vision-and-language navigation
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …
instructions and navigate in real scenes. To remember previously visited locations and …