محقق Google

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023‏ - Elsevier‏

Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …‏

ذخیره ارجاع بیان شده در 151 یافته مقاله‌های مربوط تمام نسخه‌های 5

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Long-clip: Unlocking the long-text capability of clip‏

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024‏ - Springer‏

Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …‏

ذخیره ارجاع بیان شده در 83 یافته مقاله‌های مربوط تمام نسخه‌های 7

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone‏

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023‏ - openaccess.thecvf.com‏

Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …‏

ذخیره ارجاع بیان شده در 71 یافته مقاله‌های مربوط تمام نسخه‌های 7 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Fusecap: Leveraging large language models for enriched fused image captions‏

N Rotstein, D Bensaid, S Brody… - Proceedings of the …, 2024‏ - openaccess.thecvf.com‏

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …‏

ذخیره ارجاع بیان شده در 55 یافته مقاله‌های مربوط تمام نسخه‌های 5 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation‏

B He, X Jia, S Liang, T Lou, Y Liu, X Cao - ar** in the Era of Large Models‏

B YANG, Y CHEN, Q ZOU - Geomatics and Information Science of …, 2023‏ - ch.whu.edu.cn‏

Currently, spatiotemporal information, positioning and navigation have become important
new infrastructures. Driven by general artificial intelligence, the era of intelligence led by …‏

ذخیره ارجاع بیان شده در 7 یافته مقاله‌های مربوط نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Apollo: unified adapter and prompt learning for vision language models‏

S Chowdhury, S Nag, D Manocha - arxiv preprint arxiv:2312.01564, 2023‏ - arxiv.org‏

The choice of input text prompt plays a critical role in the performance of Vision-Language
Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach …‏

ذخیره ارجاع بیان شده در 19 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Groundvlp: Harnessing zero-shot visual grounding from vision-language pre-training and open-vocabulary object detection‏

H Shen, T Zhao, M Zhu, J Yin - Proceedings of the AAAI Conference on …, 2024‏ - ojs.aaai.org‏

Visual grounding, a crucial vision-language task involving the understanding of the visual
context based on the query expression, necessitates the model to capture the interactions …‏

ذخیره ارجاع بیان شده در 11 یافته مقاله‌های مربوط تمام نسخه‌های 6 نسخه HTML

Gradient-based visual explanation for transformer-based clip‏

C Zhao, K Wang, X Zeng, R Zhao… - … on Machine Learning, 2024‏ - proceedings.mlr.press‏

Significant progress has been achieved on the improvement and downstream usages of the
Contrastive Language-Image Pre-training (CLIP) vision-language model, while less …‏

ذخیره ارجاع بیان شده در 8 یافته مقاله‌های مربوط تمام نسخه‌های 4 ذخیره‌شده

[Free GPT-4]
[DeepSeek]

[PDF] aaai.org

Learning to learn better visual prompts‏

F Wang, W Huang, S Yang, Q Fan, L Lan - Proceedings of the AAAI …, 2024‏ - ojs.aaai.org‏

Prompt tuning provides a low-cost way of adapting vision-language models (VLMs) for
various downstream vision tasks without requiring updating the huge pre-trained …‏

ذخیره ارجاع بیان شده در 4 یافته مقاله‌های مربوط تمام نسخه‌های 2 نسخه HTML

ایجاد هشدار

ارجاع

جستجوی پیشرفته

در «کتابخانه من» ذخیره شد

Position-guided text prompt for vision-language pre-training

[HTML][HTML] Review of large vision models and visual prompt engineering‏

Long-clip: Unlocking the long-text capability of clip‏

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone‏

Fusecap: Leveraging large language models for enriched fused image captions‏

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation‏

Apollo: unified adapter and prompt learning for vision language models‏

Groundvlp: Harnessing zero-shot visual grounding from vision-language pre-training and open-vocabulary object detection‏

Gradient-based visual explanation for transformer-based clip‏

Learning to learn better visual prompts‏