- Academic Search

S **ao, Y Wang, J Zhou, H Yuan, X **ng, R Yan… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …‏

שמור צטט צוטט על ידי 28 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues‏

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2024‏ - Springer‏

Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 10 הגרסאות

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Contrastive localized language-image pre-training‏

HY Chen, Z Lai, H Zhang, X Wang, M Eichner… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …‏

שמור צטט צוטט על ידי 5 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm2clip: Powerful language model unlock richer visual representation‏

W Huang, A Wu, Y Yang, X Luo, Y Yang, L Hu… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

CLIP is one of the most important multimodal foundational models today. What powers
CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of …‏

שמור צטט צוטט על ידי 5 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities‏

MU Khattak, S Kunhimon, M Naseer, S Khan… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Vision-Language Models (VLMs) trained via contrastive learning have achieved notable
success in natural image tasks. However, their application in the medical domain remains …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Revisit large-scale image-caption data in pre-training multimodal foundation models‏

Z Lai, V Saveris, C Chen, HY Chen, H Zhang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Recent advancements in multimodal models highlight the value of rewritten captions for
improving performance, yet key challenges remain. For example, while synthetic captions …‏

שמור צטט צוטט על ידי 3 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dual diffusion for unified image generation and understanding‏

Z Li, H Li, Y Shi, AB Farimani, Y Kluger, L Yang… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Diffusion models have gained tremendous success in text-to-image generation, yet still lag
behind with visual understanding tasks, an area dominated by autoregressive vision …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Active data curation effectively distills large-scale multimodal models‏

V Udandarao, N Parthasarathy, MF Naeem… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Knowledge distillation (KD) is the de facto standard for compressing large-scale models into
smaller ones. Prior works have explored ever more complex KD strategies involving different …‏

שמור צטט צוטט על ידי 3 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation‏

Z Li, D Muhtar, F Gu, X Zhang, P **ao, G He… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Automatically and rapidly understanding Earth's surface is fundamental to our grasp of the
living environment and informed decision-making. This underscores the need for a unified …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Clip-moe: Towards building mixture of experts for clip with diversified multiplet upcycling‏

J Zhang, X Qu, T Zhu, Y Cheng - arxiv preprint arxiv:2409.19291, 2024‏ - arxiv.org‏

In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone
in multimodal intelligence. However, recent studies have identified that the information loss …‏

שמור צטט צוטט על ידי 1 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

What If We Recaption Billions of Web Images with LLaMA-3?

Omnigen: Unified image generation‏

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues‏

Contrastive localized language-image pre-training‏

Llm2clip: Powerful language model unlock richer visual representation‏

Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities‏

Revisit large-scale image-caption data in pre-training multimodal foundation models‏

Dual diffusion for unified image generation and understanding‏

Active data curation effectively distills large-scale multimodal models‏

Lhrs-bot-nova: Improved multimodal large language model for remote sensing vision-language interpretation‏

Clip-moe: Towards building mixture of experts for clip with diversified multiplet upcycling‏