Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
From show to tell: A survey on deep learning-based image captioning
Connecting Vision and Language plays an essential role in Generative Intelligence. For this
reason, large research efforts have been devoted to image captioning, ie describing images …
reason, large research efforts have been devoted to image captioning, ie describing images …
Llama-adapter: Efficient fine-tuning of language models with zero-init attention
We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …
into an instruction-following model. Using 52K self-instruct demonstrations, LLaMA-Adapter …
Prompting large language models with answer heuristics for knowledge-based visual question answering
Abstract Knowledge-based visual question answering (VQA) requires external knowledge
beyond the image to answer the question. Early studies retrieve required knowledge from …
beyond the image to answer the question. Early studies retrieve required knowledge from …
Flava: A foundational language and vision alignment model
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
pretraining for obtaining good performance on a variety of downstream tasks. Generally …
Merlot: Multimodal neural script knowledge models
As humans, we understand events in the visual world contextually, performing multimodal
reasoning across time to make inferences about the past, present, and future. We introduce …
reasoning across time to make inferences about the past, present, and future. We introduce …
How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
Less is more: Clipbert for video-and-language learning via sparse sampling
The canonical approach to video-and-language learning (eg, video question answering)
dictates a neural model to learn from offline-extracted dense video features from vision …
dictates a neural model to learn from offline-extracted dense video features from vision …
Vinvl: Revisiting visual representations in vision-language models
This paper presents a detailed study of improving vision features and develops an improved
object detection model for vision language (VL) tasks. Compared to the most widely used …
object detection model for vision language (VL) tasks. Compared to the most widely used …
Multi-grained vision language pre-training: Aligning texts with visual concepts
Most existing methods in vision language pre-training rely on object-centric features
extracted through object detection and make fine-grained alignments between the extracted …
extracted through object detection and make fine-grained alignments between the extracted …