Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Deep learning-based action detection in untrimmed videos: A survey

E Vahdani, Y Tian - IEEE Transactions on Pattern Analysis and …, 2022 - ieeexplore.ieee.org
Understanding human behavior and activity facilitates advancement of numerous real-world
applications, and is critical for video analysis. Despite the progress of action recognition …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

S4nd: Modeling images and videos as multidimensional signals with state spaces

E Nguyen, K Goel, A Gu, G Downs… - Advances in neural …, 2022 - proceedings.neurips.cc
Visual data such as images and videos are typically modeled as discretizations of inherently
continuous, multidimensional signals. Existing continuous-signal models attempt to exploit …

Mvitv2: Improved multiscale vision transformers for classification and detection

Y Li, CY Wu, H Fan, K Mangalam… - Proceedings of the …, 2022 - openaccess.thecvf.com
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for
image and video classification, as well as object detection. We present an improved version …

Prompting visual-language models for efficient video understanding

C Ju, T Han, K Zheng, Y Zhang, W **e - European Conference on …, 2022 - Springer
Image-based visual-language (I-VL) pre-training has shown great success for learning joint
visual-textual representations from large-scale web data, revealing remarkable ability for …

Multiscale vision transformers

H Fan, B **ong, K Mangalam, Y Li… - Proceedings of the …, 2021 - openaccess.thecvf.com
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition,
by connecting the seminal idea of multiscale feature hierarchies with transformer models …

Memvit: Memory-augmented multiscale vision transformer for efficient long-term video recognition

CY Wu, Y Li, K Mangalam, H Fan… - Proceedings of the …, 2022 - openaccess.thecvf.com
While today's video recognition systems parse snapshots or short clips accurately, they
cannot connect the dots and reason across a longer range of time yet. Most existing video …

End-to-end generative pretraining for multimodal video captioning

PH Seo, A Nagrani, A Arnab… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recent video and language pretraining frameworks lack the ability to generate sentences.
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining …

A large-scale study on unsupervised spatiotemporal representation learning

C Feichtenhofer, H Fan, B **ong… - Proceedings of the …, 2021 - openaccess.thecvf.com
We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …