Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
Videomamba: State space model for efficient video understanding
Addressing the dual challenges of local redundancy and global dependencies in video
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …
understanding, this work innovatively adapts the Mamba to the video domain. The proposed …
Vid2seq: Large-scale pretraining of a visual language model for dense video captioning
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …
mplug-2: A modularized multi-modal foundation model across text, image and video
Recent years have witnessed a big convergence of language, vision, and multi-modal
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …
pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized …
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
The quality of the data and annotation upper-bounds the quality of a downstream model.
While there exist large text corpora and image-text pairs high-quality video-text data is much …
While there exist large text corpora and image-text pairs high-quality video-text data is much …
Learning open-vocabulary semantic segmentation models from natural language supervision
In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS),
which aims to segment objects of arbitrary classes instead of pre-defined, closed-set …
which aims to segment objects of arbitrary classes instead of pre-defined, closed-set …
Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …
have achieved outstanding performance, which pursue semantic interaction upon pre …
Cap4video: What can auxiliary captions do for text-video retrieval?
Most existing text-video retrieval methods focus on cross-modal matching between the
visual content of videos and textual query sentences. However, in real-world scenarios …
visual content of videos and textual query sentences. However, in real-world scenarios …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
Hitea: Hierarchical temporal-aware video-language pre-training
Video-language pre-training has advanced the performance of various downstream video-
language tasks. However, most previous methods directly inherit or adapt typical image …
language tasks. However, most previous methods directly inherit or adapt typical image …