[HTML][HTML] Review of large vision models and visual prompt engineering

J Wang, Z Liu, L Zhao, Z Wu, C Ma, S Yu, H Dai… - Meta-Radiology, 2023 - Elsevier
Visual prompt engineering is a fundamental methodology in the field of visual and image
artificial general intelligence. As the development of large vision models progresses, the …

Clip in medical imaging: A comprehensive survey

Z Zhao, Y Liu, H Wu, M Wang, Y Li, S Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training
paradigm, successfully introduces text supervision to vision models. It has shown promising …

Long-clip: Unlocking the long-text capability of clip

B Zhang, P Zhang, X Dong, Y Zang, J Wang - European Conference on …, 2024 - Springer
Abstract Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-
shot classification, text-image retrieval, and text-image generation by aligning image and …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Blind image quality assessment via vision-language correspondence: A multitask learning perspective

W Zhang, G Zhai, Y Wei, X Yang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We aim at advancing blind image quality assessment (BIQA), which predicts the human
perception of image quality without any reference information. We develop a general and …

Motionclip: Exposing human motion generation to clip space

G Tevet, B Gordon, A Hertz, AH Bermano… - … on Computer Vision, 2022 - Springer
We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding
that is disentangled, well behaved, and supports highly semantic textual descriptions …

Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

L Lian, B Li, A Yala, T Darrell - arxiv preprint arxiv:2305.13655, 2023 - arxiv.org
Recent advancements in text-to-image diffusion models have yielded impressive results in
generating realistic and diverse images. However, these models still struggle with complex …

Reprompt: Automatic prompt editing to refine ai-generative art towards precise expressions

Y Wang, S Shen, BY Lim - Proceedings of the 2023 CHI conference on …, 2023 - dl.acm.org
Generative AI models have shown impressive ability to produce images with text prompts,
which could benefit creativity in visual art creation and self-expression. However, it is …

Teaching clip to count to ten

R Paiss, A Ephrat, O Tov, S Zada… - Proceedings of the …, 2023 - openaccess.thecvf.com
Large vision-language models, such as CLIP, learn robust representations of text and
images, facilitating advances in many downstream tasks, including zero-shot classification …

Clipdraw: Exploring text-to-drawing synthesis through language-image encoders

K Frans, L Soros, O Witkowski - Advances in Neural …, 2022 - proceedings.neurips.cc
CLIPDraw is an algorithm that synthesizes novel drawings from natural language input. It
does not require any additional training; rather, a pre-trained CLIP language-image encoder …