Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
Multimodal learning with transformers: A survey
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …
Flamingo: a visual language model for few-shot learning
Building models that can be rapidly adapted to novel tasks using only a handful of annotated
examples is an open challenge for multimodal machine learning research. We introduce …
examples is an open challenge for multimodal machine learning research. We introduce …
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs)
have emerged as a focal point in recent advancements. However, the predominant focus …
have emerged as a focal point in recent advancements. However, the predominant focus …
Self-chained image-language model for video localization and question answering
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …
models for video question answering. While these image-language models can efficiently …
Zero-shot video question answering via frozen bidirectional language models
Video question answering (VideoQA) is a complex task that requires diverse multi-modal
data for training. Manual annotation of question and answers for videos, however, is tedious …
data for training. Manual annotation of question and answers for videos, however, is tedious …
Omnivl: One foundation model for image-language and video-language tasks
This paper presents OmniVL, a new foundation model to support both image-language and
video-language tasks using one universal architecture. It adopts a unified transformer-based …
video-language tasks using one universal architecture. It adopts a unified transformer-based …
Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks
Recently, fine-tuning language models pre-trained on large text corpora have provided huge
improvements on vision-and-language (V&L) tasks as well as on pure language tasks …
improvements on vision-and-language (V&L) tasks as well as on pure language tasks …