[HTML][HTML] Review of large vision models and visual prompt engineering
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …
artificial general intelligence. As the development of large vision models progresses, the …
Long-clip: Unlocking the long-text capability of clip
Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …
shot classification, text-image retrieval, and text-image generation by aligning image and …
Egovlpv2: Egocentric video-language pre-training with fusion in the backbone
Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …
generalize to various vision and language tasks. However, existing egocentric VLP …
Fusecap: Leveraging large language models for enriched fused image captions
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …
development of models for image captioning. However, these models frequently produce …
Gradient-based visual explanation for transformer-based clip
Significant progress has been achieved on the improvement and downstream usages of the
Contrastive Language-Image Pre-training (CLIP) vision-language model, while less …
Contrastive Language-Image Pre-training (CLIP) vision-language model, while less …
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions
The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …
development of models for image captioning. However, these models frequently produce …
E-clip: Towards label-efficient event-based open-world understanding by clip
Contrasting Language-image pertaining (CLIP) has recently shown promising open-world
and few-shot performance on 2D image-based recognition tasks. However, the transferred …
and few-shot performance on 2D image-based recognition tasks. However, the transferred …
Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation
Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples.
These adversarial examples present substantial security risks to VLP models, as they can …
These adversarial examples present substantial security risks to VLP models, as they can …
GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator
Abstract Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-
solving capabilities, but few research studies aim to gauge the ability to generate visual …
solving capabilities, but few research studies aim to gauge the ability to generate visual …
Eventbind: Learning a unified representation to bind them all for event-based open-world understanding
In this paper, we propose EventBind, a novel and effective framework that unleashes the
potential of vision-language models (VLMs) for event-based recognition to compensate for …
potential of vision-language models (VLMs) for event-based recognition to compensate for …