Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
A comprehensive survey of few-shot learning: Evolution, applications, challenges, and opportunities
Few-shot learning (FSL) has emerged as an effective learning method and shows great
potential. Despite the recent creative works in tackling FSL tasks, learning valid information …
potential. Despite the recent creative works in tackling FSL tasks, learning valid information …
Video-chatgpt: Towards detailed video understanding via large vision and language models
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to
interact with visual data. While there have been initial attempts for image-based …
interact with visual data. While there have been initial attempts for image-based …
Internvideo: General video foundation models via generative and discriminative learning
The foundation models have recently shown excellent performance on a variety of
downstream tasks in computer vision. However, most existing vision foundation models …
downstream tasks in computer vision. However, most existing vision foundation models …
Expanding language-image pretrained models for general video recognition
Contrastive language-image pretraining has shown great success in learning visual-textual
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
joint representation from web-scale data, demonstrating remarkable “zero-shot” …
[HTML][HTML] Review of large vision models and visual prompt engineering
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …
artificial general intelligence. As the development of large vision models progresses, the …
Denseclip: Language-guided dense prediction with context-aware prompting
Recent progress has shown that large-scale pre-training using contrastive image-text pairs
can be a promising alternative for high-quality visual representation learning from natural …
can be a promising alternative for high-quality visual representation learning from natural …
Large-scale multi-modal pre-trained models: A comprehensive survey
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …
St-adapter: Parameter-efficient image-to-video transfer learning
Capitalizing on large pre-trained models for various downstream tasks of interest have
recently emerged with promising performance. Due to the ever-growing model size, the …
recently emerged with promising performance. Due to the ever-growing model size, the …
Pointclip: Point cloud understanding by clip
Recently, zero-shot and few-shot learning via Contrastive Vision-Language Pre-training
(CLIP) have shown inspirational performance on 2D visual recognition, which learns to …
(CLIP) have shown inspirational performance on 2D visual recognition, which learns to …