Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Visual tuning
Fine-tuning visual models has been widely shown promising performance on many
downstream visual tasks. With the surprising development of pre-trained visual foundation …
downstream visual tasks. With the surprising development of pre-trained visual foundation …
Unified coarse-to-fine alignment for video-text retrieval
The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained
alignment between visual and textual information. However, retrieving the correct video …
alignment between visual and textual information. However, retrieving the correct video …
Prompt switch: Efficient clip adaptation for text-video retrieval
In text-video retrieval, recent works have benefited from the powerful learning capabilities of
pre-trained text-image foundation models (eg, CLIP) by adapting them to the video domain …
pre-trained text-image foundation models (eg, CLIP) by adapting them to the video domain …
Mma: Multi-modal adapter for vision-language models
Abstract Pre-trained Vision-Language Models (VLMs) have served as excellent foundation
models for transfer learning in diverse downstream tasks. However tuning VLMs for few-shot …
models for transfer learning in diverse downstream tasks. However tuning VLMs for few-shot …
Parameter-efficient transfer learning for remote sensing image–text retrieval
Vision-and-language pretraining (VLP) models have experienced a surge in popularity
recently. By fine-tuning them on specific datasets, significant performance improvements …
recently. By fine-tuning them on specific datasets, significant performance improvements …
Few-shot adaptation of multi-modal foundation models: A survey
Abstract Multi-modal (vision-language) models, such as CLIP, are replacing traditional
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …
supervised pre-training models (eg, ImageNet-based pre-training) as the new generation of …
DGL: Dynamic global-local prompt tuning for text-video retrieval
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text
query. Although pretrained models like CLIP have demonstrated impressive potential in this …
query. Although pretrained models like CLIP have demonstrated impressive potential in this …
Towards Efficient Audio-Visual Learners via Empowering Pre-trained Vision Transformers with Cross-Modal Adaptation
In this paper we explore the cross-modal adaptation of pre-trained Vision Transformers
(ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To …
(ViTs) for the audio-visual domain by incorporating a limited set of trainable parameters. To …
Troika: Multi-path cross-modal traction for compositional zero-shot learning
Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language
models (VLMs) by constructing trainable prompts only for composed state-object pairs …
models (VLMs) by constructing trainable prompts only for composed state-object pairs …
Rap: Efficient text-video retrieval with sparse-and-correlated adapter
Text-Video Retrieval (TVR) aims to align relevant video content with natural language
queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning …
queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning …