Google Akademik

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Kaydet Alıntı yap Alıntılanma sayısı: 37 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Contrastive localized language-image pre-training

HY Chen, Z Lai, H Zhang, X Wang, M Eichner… - arxiv preprint arxiv …, 2024 - arxiv.org

Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …

Kaydet Alıntı yap Alıntılanma sayısı: 5 İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Revisit large-scale image-caption data in pre-training multimodal foundation models

Z Lai, V Saveris, C Chen, HY Chen, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advancements in multimodal models highlight the value of rewritten captions for
improving performance, yet key challenges remain. For example, while synthetic captions …

Kaydet Alıntı yap Alıntılanma sayısı: 3 İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling inference-time search with vision value model for improved visual comprehension

W **yao, Y Zhengyuan, L Linjie, L Hong**… - arxiv preprint arxiv …, 2024 - arxiv.org

Despite significant advancements in vision-language models (VLMs), there lacks effective
approaches to enhance response quality by scaling inference-time computation. This …

Kaydet Alıntı yap Alıntılanma sayısı: 3 İlgili makaleler 4 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S **ao, M Chen, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …

Kaydet Alıntı yap Alıntılanma sayısı: 2 İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Detect, describe, discriminate: Moving beyond vqa for mllm evaluation

M Gaur, M Tapaswi - arxiv preprint arxiv:2409.15125, 2024 - arxiv.org

Visual Question Answering (VQA) with multiple choice questions enables a vision-centric
evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the …

Kaydet Alıntı yap Alıntılanma sayısı: 3 İlgili makaleler 4 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

X Wang, J Wu, Z Lin, F Zhang… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org

Recently, video-language understanding has achieved great success through large-scale
pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively …

Kaydet Alıntı yap İlgili makaleler 4 sürümün hepsi

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Y Liu, X Li, Z Wang, B Zhao, C **e - arxiv preprint arxiv:2411.16828, 2024 - arxiv.org

Previous works show that noisy, web-crawled image-text pairs may limit vision-language
pretraining like CLIP and propose learning with synthetic captions as a promising …

Kaydet Alıntı yap İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

STIV: Scalable Text and Image Conditioned Video Generation

Z Lin, W Liu, C Chen, J Lu, W Hu, TJ Fu… - arxiv preprint arxiv …, 2024 - arxiv.org

The field of video generation has made remarkable advancements, yet there remains a
pressing need for a clear, systematic recipe that can guide the development of robust and …

Kaydet Alıntı yap İlgili makaleler 3 sürümün hepsi HTML olarak görüntüle

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

TIPS: Text-Image Pretraining with Spatial Awareness

KK Maninis, K Chen, S Ghosh, A Karpur… - arxiv preprint arxiv …, 2024 - arxiv.org

While image-text representation learning has become very popular in recent years, existing
models tend to lack spatial awareness and have limited direct applicability for dense …

Kaydet Alıntı yap İlgili makaleler 2 sürümün hepsi HTML olarak görüntüle

Uyarı oluştur

Alıntı yap

Gelişmiş arama

Kitaplığım'a kaydedildi

Veclip: Improving clip training via visual-enriched captions

A survey of multimodal large language model from a data-centric perspective

Contrastive localized language-image pre-training

Revisit large-scale image-caption data in pre-training multimodal foundation models

Scaling inference-time search with vision value model for improved visual comprehension

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Detect, describe, discriminate: Moving beyond vqa for mllm evaluation

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

STIV: Scalable Text and Image Conditioned Video Generation

TIPS: Text-Image Pretraining with Spatial Awareness