Glyph-byt5: A customized text encoder for accurate visual text rendering

Z Liu, W Liang, Z Liang, C Luo, J Li, G Huang… - … on Computer Vision, 2024 - Springer
Visual text rendering poses a fundamental challenge for contemporary text-to-image
generation models, with the core problem lying in text encoder deficiencies. To achieve …

Contrastive localized language-image pre-training

HY Chen, Z Lai, H Zhang, X Wang, M Eichner… - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …

ProtCLIP: Function-informed protein multi-modal learning

H Zhou, M Yin, W Wu, M Li, K Fu, J Chen, J Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Multi-modality pre-training paradigm that aligns protein sequences and biological
descriptions has learned general protein representations and achieved promising …

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Y Zhao, Y Yin, L Li, M Lin, VSJ Huang, S Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Does seeing always mean knowing? Large Vision-Language Models (LVLMs) integrate
separately pre-trained vision and language components, often using CLIP-ViT as vision …

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Y Oh, JW Cho, DJ Kim, IS Kweon, J Kim - arxiv preprint arxiv:2410.05210, 2024 - arxiv.org
In this paper, we propose a new method to enhance compositional understanding in pre-
trained vision and language models (VLMs) without sacrificing performance in zero-shot …

softmax is not enough (for sharp out-of-distribution)

P Veličković, C Perivolaropoulos, F Barbero… - arxiv preprint arxiv …, 2024 - arxiv.org
A key property of reasoning systems is the ability to make sharp decisions on their input
data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function …

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

C Lei, J Fan, X Li, T **ang, A Li, C Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of
annotated data, where meticulous pixel-level annotation is both labor-intensive and costly …

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

M Maniparambil, R Akshulakov, YAD Djilali… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent contrastive multimodal vision-language models like CLIP have demonstrated robust
open-world semantic understanding, becoming the standard image backbones for vision …

A multimodal similarity-aware and knowledge-driven pre-training approach for reliable pneumoconiosis diagnosis

X Ren, G Ji, S Chu, S Yoshida, J Zhao… - Journal of X-Ray …, 2025 - journals.sagepub.com
Background Pneumoconiosis staging is challenging due to the low clarity of X-ray images
and the small, diffuse nature of the lesions. Additionally, the scarcity of annotated data …

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework with Consistency Regularization for Remote Sensing Image Text Retrieval

D **u, L Ji, X Geng, Y Wu - IEEE Geoscience and Remote …, 2024 - ieeexplore.ieee.org
Vision-language models have demonstrated impressive capabilities in associating images
and text by pretraining on extensive image-text paired data. The paradigm of continual …