Clip in medical imaging: A comprehensive survey

Z Zhao, Y Liu, H Wu, M Wang, Y Li, S Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
Contrastive Language-Image Pre-training (CLIP), a simple yet effective pre-training
paradigm, successfully introduces text supervision to vision models. It has shown promising …

Cross-modal retrieval: a systematic review of methods and future directions

T Wang, F Li, L Zhu, J Li, Z Zhang… - Proceedings of the …, 2025 - ieeexplore.ieee.org
With the exponential surge in diverse multimodal data, traditional unimodal retrieval
methods struggle to meet the needs of users seeking access to data across various …

A survey on rag meeting llms: Towards retrieval-augmented large language models

W Fan, Y Ding, L Ning, S Wang, H Li, D Yin… - Proceedings of the 30th …, 2024 - dl.acm.org
As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can
offer reliable and up-to-date external knowledge, providing huge convenience for numerous …

Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

Retrieval-augmented multimodal language modeling

M Yasunaga, A Aghajanyan, W Shi, R James… - arxiv preprint arxiv …, 2022 - arxiv.org
Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress
in text-to-image and image-to-text generation. However, these models store all learned …

Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms

D Caffagni, F Cocchi, N Moratelli… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to
work beyond the pure textual modality. As research is being carried out to design novel …

Meacap: Memory-augmented zero-shot image captioning

Z Zeng, Y **e, H Zhang, C Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …

Transferable decoding with visual entities for zero-shot image captioning

J Fei, T Wang, J Zhang, Z He… - Proceedings of the …, 2023 - openaccess.thecvf.com
Image-to-text generation aims to describe images using natural language. Recently, zero-
shot image captioning based on pre-trained vision-language models (VLMs) and large …

Visual-augmented dynamic semantic prototype for generative zero-shot learning

W Hou, S Chen, S Chen, Z Hong… - Proceedings of the …, 2024 - openaccess.thecvf.com
Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for
unseen classes which is an effective way to advance ZSL. However existing generative …

Exploring diverse in-context configurations for image captioning

X Yang, Y Wu, M Yang, H Chen… - Advances in Neural …, 2023 - proceedings.neurips.cc
Abstract After discovering that Language Models (LMs) can be good in-context few-shot
learners, numerous strategies have been proposed to optimize in-context sequence …