[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

Long-clip: Unlocking the long-text capability of clip

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024 - Springer
Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

Gradient-based visual explanation for transformer-based clip

C Zhao, K Wang, X Zeng, R Zhao… - … on Machine Learning, 2024 - proceedings.mlr.press
Significant progress has been achieved on the improvement and downstream usages of the
Contrastive Language-Image Pre-training (CLIP) vision-language model, while less …

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arxiv preprint arxiv …, 2023 - arxiv.org
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

E-clip: Towards label-efficient event-based open-world understanding by clip

J Zhou, X Zheng, Y Lyu, L Wang - arxiv preprint arxiv:2308.03135, 2023 - arxiv.org
Contrasting Language-image pertaining (CLIP) has recently shown promising open-world
and few-shot performance on 2D image-based recognition tasks. However, the transferred …

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

B He, X Jia, S Liang, T Lou, Y Liu, X Cao - arxiv preprint arxiv:2312.04913, 2023 - arxiv.org
Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples.
These adversarial examples present substantial security risks to VLP models, as they can …

GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator

HH Zhao, P Zhou, MZ Shou - European Conference on Computer Vision, 2024 - Springer
Abstract Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-
solving capabilities, but few research studies aim to gauge the ability to generate visual …

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding

J Zhou, X Zheng, Y Lyu, L Wang - European Conference on Computer …, 2024 - Springer
In this paper, we propose EventBind, a novel and effective framework that unleashes the
potential of vision-language models (VLMs) for event-based recognition to compensate for …