Visual tuning

BXB Yu, J Chang, H Wang, L Liu, S Wang… - ACM Computing …, 2024‏ - dl.acm.org
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …

Unified coarse-to-fine alignment for video-text retrieval

Z Wang, YL Sung, F Cheng… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained
alignment between visual and textual information. However, retrieving the correct video …

Prompt switch: Efficient clip adaptation for text-video retrieval

C Deng, Q Chen, P Qin, D Chen… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
In text-video retrieval, recent works have benefited from the powerful learning capabilities of
pre-trained text-image foundation models (eg, CLIP) by adapting them to the video domain …

Mma: Multi-modal adapter for vision-language models

L Yang, RY Zhang, Y Wang… - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
Abstract Pre-trained Vision-Language Models (VLMs) have served as excellent foundation
models for transfer learning in diverse downstream tasks. However tuning VLMs for few-shot …

Parameter-efficient transfer learning for remote sensing image–text retrieval

Y Yuan, Y Zhan, Z **ong - IEEE Transactions on Geoscience …, 2023‏ - ieeexplore.ieee.org
Vision-and-language pretraining (VLP) models have experienced a surge in popularity
recently. By fine-tuning them on specific datasets, significant performance improvements …

Few-shot adaptation of multi-modal foundation models: A survey

F Liu, T Zhang, W Dai, C Zhang, W Cai, X Zhou… - Artificial Intelligence …, 2024‏ - Springer
Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …

DGL: Dynamic global-local prompt tuning for text-video retrieval

X Yang, L Zhu, X Wang, Y Yang - … of the AAAI Conference on Artificial …, 2024‏ - ojs.aaai.org
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text
query. Although pretrained models like CLIP have demonstrated impressive potential in this …

Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation

K Wang, Y Tian, D Hatzinakos - Proceedings of the IEEE …, 2024‏ - openaccess.thecvf.com
In this paper we explore the cross-modal adaptation of pre-trained Vision Transformers
(ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To …

Troika: Multi-path cross-modal traction for compositional zero-shot learning

S Huang, B Gong, Y Feng, M Zhang… - Proceedings of the …, 2024‏ - openaccess.thecvf.com
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language
models (VLMs) by constructing trainable prompts only for composed state-object pairs …

Rap: Efficient text-video retrieval with sparse-and-correlated adapter

M Cao, H Tang, J Huang, P **, C Zhang, R Liu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Text-Video Retrieval (TVR) aims to align relevant video content with natural language
queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning …