Omnigen: Unified image generation

S **ao, Y Wang, J Zhou, H Yuan, X **ng, R Yan… - arxiv preprint arxiv …, 2024‏ - arxiv.org
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2024‏ - Springer
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …

Contrastive localized language-image pre-training

HY Chen, Z Lai, H Zhang, X Wang, M Eichner… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …

Llm2clip: Powerful language model unlock richer visual representation

W Huang, A Wu, Y Yang, X Luo, Y Yang, L Hu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
CLIP is one of the most important multimodal foundational models today. What powers
CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of …

Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities

MU Khattak, S Kunhimon, M Naseer, S Khan… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Vision-Language Models (VLMs) trained via contrastive learning have achieved notable
success in natural image tasks. However, their application in the medical domain remains …

Revisit large-scale image-caption data in pre-training multimodal foundation models

Z Lai, V Saveris, C Chen, HY Chen, H Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent advancements in multimodal models highlight the value of rewritten captions for
improving performance, yet key challenges remain. For example, while synthetic captions …

Dual diffusion for unified image generation and understanding

Z Li, H Li, Y Shi, AB Farimani, Y Kluger, L Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Diffusion models have gained tremendous success in text-to-image generation, yet still lag
behind with visual understanding tasks, an area dominated by autoregressive vision …

Active data curation effectively distills large-scale multimodal models

V Udandarao, N Parthasarathy, MF Naeem… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …

Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation

Z Li, D Muhtar, F Gu, X Zhang, P **ao, G He… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Automatically and rapidly understanding Earth's surface is fundamental to our grasp of the
living environment and informed decision-making. This underscores the need for a unified …

Clip-moe: Towards building mixture of experts for clip with diversified multiplet upcycling

J Zhang, X Qu, T Zhu, Y Cheng - arxiv preprint arxiv:2409.19291, 2024‏ - arxiv.org
In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone
in multimodal intelligence. However, recent studies have identified that the information loss …