Salience-based adaptive masking: revisiting token dynamics for enhanced pre-training

H Choi, H Park, KM Yi, S Cha, D Min - European Conference on Computer …, 2024 - Springer
In this paper, we introduce Saliency-Based Adaptive Masking (SBAM), a novel and cost-
effective approach that significantly enhances the pre-training performance of Masked …

CLIP-KD: An Empirical Study of Distilling CLIP Models

C Yang, Z An, L Huang, J Bi, X Yu, H Yang… - arxiv preprint arxiv …, 2023 - arxiv.org
CLIP has become a promising language-supervised visual pre-training framework and
achieves excellent performance over a wide range of tasks. This paper aims to distill small …

Detailclip: Detail-oriented clip for fine-grained tasks

AK Monsefi, KP Sailaja, A Alilooee, SN Lim… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of
contrastive learning-based vision-language models, particularly CLIP, in handling detail …

UMG-clip: a unified multi-granularity vision generalist for open-world understanding

B Shi, P Zhao, Z Wang, Y Zhang, Y Wang, J Li… - … on Computer Vision, 2024 - Springer
Vision-language foundation models, represented by Contras-tive Language-Image Pre-
training (CLIP), have gained increasing attention for jointly understanding both vision and …

Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval

D Hu, Z Jiang, J Shi, F **e, K Wu, K Tang, M Cao… - Medical Image …, 2024 - Elsevier
Large-scale digital whole slide image (WSI) datasets analysis have gained significant
attention in computer-aided cancer diagnosis. Content-based histopathological image …

CLIP-KD: An Empirical Study of CLIP Model Distillation

C Yang, Z An, L Huang, J Bi, X Yu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Contrastive Language-Image Pre-training (CLIP) has become a promising
language-supervised visual pre-training framework. This paper aims to distill small CLIP …

Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

MSU Khan, MF Naeem, F Tombari, L Van Gool… - arxiv preprint arxiv …, 2024 - arxiv.org
We present a novel LLM-based pipeline for creating contextual descriptions of human body
poses in images using only auxiliary attributes. This approach facilitates the creation of the …

Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos

R Liang, Y Li, J Zhou, X Li - … on Circuits and Systems for Video …, 2024 - ieeexplore.ieee.org
Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of
autonomous driving and advanced driver assistance systems. Previous single-stage TAD …

Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models

S Long, Z Zhao, J Yuan, Z Tan, J Liu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Prompt learning has become one of the most efficient paradigms for adapting large pre-
trained vision-language models to downstream tasks. Current state-of-the-art methods, like …

Pcqa: A strong baseline for aigc quality assessment based on prompt condition

X Fang, W Wang, X Lv, J Yan - arxiv preprint arxiv:2404.13299, 2024 - arxiv.org
The development of Large Language Models (LLM) and Diffusion Models brings the boom
of Artificial Intelligence Generated Content (AIGC). It is essential to build an effective quality …