Glyph-byt5: A customized text encoder for accurate visual text rendering
Visual text rendering poses a fundamental challenge for contemporary text-to-image
generation models, with the core problem lying in text encoder deficiencies. To achieve …
generation models, with the core problem lying in text encoder deficiencies. To achieve …
Contrastive localized language-image pre-training
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …
vision encoders to generate image/text representations facilitating various applications …
ProtCLIP: Function-informed protein multi-modal learning
Multi-modality pre-training paradigm that aligns protein sequences and biological
descriptions has learned general protein representations and achieved promising …
descriptions has learned general protein representations and achieved promising …
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate
separately pre-trained vision and language components, often using CLIP-ViT as vision …
separately pre-trained vision and language components, often using CLIP-ViT as vision …
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
In this paper, we propose a new method to enhance compositional understanding in pre-
trained vision and language models (VLMs) without sacrificing performance in zero-shot …
trained vision and language models (VLMs) without sacrificing performance in zero-shot …
softmax is not enough (for sharp out-of-distribution)
A key property of reasoning systems is the ability to make sharp decisions on their input
data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function …
data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function …
Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of
annotated data, where meticulous pixel-level annotation is both labor-intensive and costly …
annotated data, where meticulous pixel-level annotation is both labor-intensive and costly …
From Unimodal to Multimodal: Scaling up Projectors to Align Modalities
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust
open-world semantic understanding, becoming the standard image backbones for vision …
open-world semantic understanding, becoming the standard image backbones for vision …
A multimodal similarity-aware and knowledge-driven pre-training approach for reliable pneumoconiosis diagnosis
X Ren, G Ji, S Chu, S Yoshida, J Zhao… - Journal of X-Ray …, 2025 - journals.sagepub.com
Background Pneumoconiosis staging is challenging due to the low clarity of X-ray images
and the small, diffuse nature of the lesions. Additionally, the scarcity of annotated data …
and the small, diffuse nature of the lesions. Additionally, the scarcity of annotated data …
RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework with Consistency Regularization for Remote Sensing Image Text Retrieval
D **u, L Ji, X Geng, Y Wu - IEEE Geoscience and Remote …, 2024 - ieeexplore.ieee.org
Vision-language models have demonstrated impressive capabilities in associating images
and text by pretraining on extensive image-text paired data. The paradigm of continual …
and text by pretraining on extensive image-text paired data. The paradigm of continual …