[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023‏ - Elsevier
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

Long-clip: Unlocking the long-text capability of clip

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024‏ - Springer
Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaid, S Brody… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

B He, X Jia, S Liang, T Lou, Y Liu, X Cao - ar** in the Era of Large Models
B YANG, Y CHEN, Q ZOU - Geomatics and Information Science of …, 2023‏ - ch.whu.edu.cn
Currently, spatiotemporal information, positioning and navigation have become important
new infrastructures. Driven by general artificial intelligence, the era of intelligence led by …

Apollo: unified adapter and prompt learning for vision language models

S Chowdhury, S Nag, D Manocha - arxiv preprint arxiv:2312.01564, 2023‏ - arxiv.org
The choice of input text prompt plays a critical role in the performance of Vision-Language
Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach …

Groundvlp: Harnessing zero-shot visual grounding from vision-language pre-training and open-vocabulary object detection

H Shen, T Zhao, M Zhu, J Yin - Proceedings of the AAAI Conference on …, 2024‏ - ojs.aaai.org
Visual grounding, a crucial vision-language task involving the understanding of the visual
context based on the query expression, necessitates the model to capture the interactions …

Gradient-based visual explanation for transformer-based clip

C Zhao, K Wang, X Zeng, R Zhao… - … on Machine Learning, 2024‏ - proceedings.mlr.press
Significant progress has been achieved on the improvement and downstream usages of the
Contrastive Language-Image Pre-training (CLIP) vision-language model, while less …

Learning to learn better visual prompts

F Wang, W Huang, S Yang, Q Fan, L Lan - Proceedings of the AAAI …, 2024‏ - ojs.aaai.org
Prompt tuning provides a low-cost way of adapting vision-language models (VLMs) for
various downstream vision tasks without requiring updating the huge pre-trained …