Google 학술 검색

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier

Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

저장 인용 146회 인용 관련 학술자료 전체 4개의 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Long-clip: Unlocking the long-text capability of clip

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024 - Springer

Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …

저장 인용 82회 인용 관련 학술자료 전체 2개의 버전

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

S Pramanick, Y Song, S Nag, KQ Lin… - Proceedings of the …, 2023 - openaccess.thecvf.com

Video-language pre-training (VLP) has become increasingly important due to its ability to
generalize to various vision and language tasks. However, existing egocentric VLP …

저장 인용 68회 인용 관련 학술자료 전체 6개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Fusecap: Leveraging large language models for enriched fused image captions

N Rotstein, D Bensaïd, S Brody… - Proceedings of the …, 2024 - openaccess.thecvf.com

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

저장 인용 32회 인용 관련 학술자료 전체 4개의 버전 HTML 버전

Gradient-based visual explanation for transformer-based clip

C Zhao, K Wang, X Zeng, R Zhao… - … on Machine Learning, 2024 - proceedings.mlr.press

Significant progress has been achieved on the improvement and downstream usages of the
Contrastive Language-Image Pre-training (CLIP) vision-language model, while less …

저장 인용 6회 인용 관련 학술자료 전체 4개의 버전 저장된 페이지

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

N Rotstein, D Bensaid, S Brody, R Ganz… - arxiv preprint arxiv …, 2023 - arxiv.org

The advent of vision-language pre-training techniques enhanced substantial progress in the
development of models for image captioning. However, these models frequently produce …

저장 인용 25회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

E-clip: Towards label-efficient event-based open-world understanding by clip

J Zhou, X Zheng, Y Lyu, L Wang - arxiv preprint arxiv:2308.03135, 2023 - arxiv.org

Contrasting Language-image pertaining (CLIP) has recently shown promising open-world
and few-shot performance on 2D image-based recognition tasks. However, the transferred …

저장 인용 17회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

B He, X Jia, S Liang, T Lou, Y Liu, X Cao - arxiv preprint arxiv:2312.04913, 2023 - arxiv.org

Current Visual-Language Pre-training (VLP) models are vulnerable to adversarial examples.
These adversarial examples present substantial security risks to VLP models, as they can …

저장 인용 26회 인용 관련 학술자료 전체 2개의 버전 HTML 버전

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator

HH Zhao, P Zhou, MZ Shou - European Conference on Computer Vision, 2024 - Springer

Abstract Multimodal Large Language Models (MLLMs) demonstrate exceptional problem-
solving capabilities, but few research studies aim to gauge the ability to generate visual …

저장 인용 6회 인용 관련 학술자료

[Free GPT-4]
[DeepSeek]

[PDF] pkwyx.com

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding

J Zhou, X Zheng, Y Lyu, L Wang - European Conference on Computer …, 2024 - Springer

In this paper, we propose EventBind, a novel and effective framework that unleashes the
potential of vision-language models (VLMs) for event-based recognition to compensate for …

저장 인용 2회 인용 관련 학술자료 전체 6개의 버전

알림 만들기

인용

고급 검색

라이브러리에 저장됨

Position-guided text prompt for vision-language pre-training

[HTML][HTML] Review of large vision models and visual prompt engineering

Long-clip: Unlocking the long-text capability of clip

Egovlpv2: Egocentric video-language pre-training with fusion in the backbone

Fusecap: Leveraging large language models for enriched fused image captions

Gradient-based visual explanation for transformer-based clip

FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions

E-clip: Towards label-efficient event-based open-world understanding by clip

Sa-attack: Improving adversarial transferability of vision-language pre-training models via self-augmentation

GENIXER: Empowering Multimodal Large Language Model as a Powerful Data Generator

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding