Salience-based adaptive masking: revisiting token dynamics for enhanced pre-training
In this paper, we introduce Saliency-Based Adaptive Masking (SBAM), a novel and cost-
effective approach that significantly enhances the pre-training performance of Masked …
effective approach that significantly enhances the pre-training performance of Masked …
CLIP-KD: An Empirical Study of Distilling CLIP Models
CLIP has become a promising language-supervised visual pre-training framework and
achieves excellent performance over a wide range of tasks. This paper aims to distill small …
achieves excellent performance over a wide range of tasks. This paper aims to distill small …
Detailclip: Detail-oriented clip for fine-grained tasks
In this paper, we introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of
contrastive learning-based vision-language models, particularly CLIP, in handling detail …
contrastive learning-based vision-language models, particularly CLIP, in handling detail …
UMG-clip: a unified multi-granularity vision generalist for open-world understanding
Vision-language foundation models, represented by Contras-tive Language-Image Pre-
training (CLIP), have gained increasing attention for jointly understanding both vision and …
training (CLIP), have gained increasing attention for jointly understanding both vision and …
Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval
Large-scale digital whole slide image (WSI) datasets analysis have gained significant
attention in computer-aided cancer diagnosis. Content-based histopathological image …
attention in computer-aided cancer diagnosis. Content-based histopathological image …
CLIP-KD: An Empirical Study of CLIP Model Distillation
Abstract Contrastive Language-Image Pre-training (CLIP) has become a promising
language-supervised visual pre-training framework. This paper aims to distill small CLIP …
language-supervised visual pre-training framework. This paper aims to distill small CLIP …
Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks
We present a novel LLM-based pipeline for creating contextual descriptions of human body
poses in images using only auxiliary attributes. This approach facilitates the creation of the …
poses in images using only auxiliary attributes. This approach facilitates the creation of the …
Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos
Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of
autonomous driving and advanced driver assistance systems. Previous single-stage TAD …
autonomous driving and advanced driver assistance systems. Previous single-stage TAD …
Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models
Prompt learning has become one of the most efficient paradigms for adapting large pre-
trained vision-language models to downstream tasks. Current state-of-the-art methods, like …
trained vision-language models to downstream tasks. Current state-of-the-art methods, like …
Pcqa: A strong baseline for aigc quality assessment based on prompt condition
The development of Large Language Models (LLM) and Diffusion Models brings the boom
of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality …
of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality …