[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Vision-language models for vision tasks: A survey

J Zhang, J Huang, S **, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks
(DNNs) training, and they usually train a DNN for each single visual recognition task …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

MM1: methods, analysis and insights from multimodal LLM pre-training

B McKinzie, Z Gan, JP Fauconnier, S Dodge… - … on Computer Vision, 2024 - Springer
In this work, we discuss building performant Multimodal Large Language Models (MLLMs).
In particular, we study the importance of various architecture components and data choices …

Segment and Recognize Anything at Any Granularity

F Li, H Zhang, P Sun, X Zou, S Liu, C Li, J Yang… - … on Computer Vision, 2024 - Springer
In this work, we introduce Semantic-SAM, an augmented image segmentation foundation for
segmenting and recognizing anything at desired granularities. Compared to the …

Openscene: 3d scene understanding with open vocabularies

S Peng, K Genova, C Jiang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a
model for a single task with supervision. We propose OpenScene, an alternative approach …

Maple: Multi-modal prompt learning

MU Khattak, H Rasheed, M Maaz… - Proceedings of the …, 2023 - openaccess.thecvf.com
Pre-trained vision-language (VL) models such as CLIP have shown excellent generalization
ability to downstream tasks. However, they are sensitive to the choice of input text prompts …

Self-regulating prompts: Foundational model adaptation without forgetting

MU Khattak, ST Wasim, M Naseer… - Proceedings of the …, 2023 - openaccess.thecvf.com
Prompt learning has emerged as an efficient alternative for fine-tuning foundational models,
such as CLIP, for various downstream tasks. Conventionally trained using the task-specific …

Visual-language prompt tuning with knowledge-guided context optimization

H Yao, R Zhang, C Xu - … of the IEEE/CVF conference on …, 2023 - openaccess.thecvf.com
Prompt tuning is an effective way to adapt the pretrained visual-language model (VLM) to
the downstream task using task-related textual tokens. Representative CoOp-based works …