Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

A Yang, A Nagrani, PH Seo, A Miech… - Proceedings of the …, 2023 - openaccess.thecvf.com
In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning
model pretrained on narrated videos which are readily-available at scale. The Vid2Seq …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

X-clip: End-to-end multi-grained contrastive learning for video-text retrieval

Y Ma, G Xu, X Sun, M Yan, J Zhang, R Ji - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The
development of video-text retrieval has been considerably promoted by large-scale multi …

All in one: Exploring unified video-language pre-training

J Wang, Y Ge, R Yan, Y Ge, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Mainstream Video-Language Pre-training models consist of three parts, a video
encoder, a text encoder, and a video-text fusion Transformer. They pursue better …

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Verbs in action: Improving verb understanding in video-language models

L Momeni, M Caron, A Nagrani… - Proceedings of the …, 2023 - openaccess.thecvf.com
Understanding verbs is crucial to modelling how people and objects interact with each other
and the environment through space and time. Recently, state-of-the-art video-language …

Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning

P **, J Huang, P **ong, S Tian, C Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive learning-based video-language representation learning approaches, eg, CLIP,
have achieved outstanding performance, which pursue semantic interaction upon pre …

Ts2-net: Token shift and selection transformer for text-video retrieval

Y Liu, P **ong, L Xu, S Cao, Q ** - European conference on computer …, 2022 - Springer
Text-Video retrieval is a task of great practical value and has received increasing attention,
among which learning spatial-temporal video representation is one of the research hotspots …

Violet: End-to-end video-language transformers with masked visual-token modeling

TJ Fu, L Li, Z Gan, K Lin, WY Wang, L Wang… - arxiv preprint arxiv …, 2021 - arxiv.org
A great challenge in video-language (VidL) modeling lies in the disconnection between
fixed video representations extracted from image/video understanding models and …