A survey of multimodal large language model from a data-centric perspective

T Bai, H Liang, B Wan, Y Xu, X Li, S Li, L Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) enhance the capabilities of standard large
language models by integrating and processing data from multiple modalities, including text …

Contrastive localized language-image pre-training

HY Chen, Z Lai, H Zhang, X Wang, M Eichner… - arxiv preprint arxiv …, 2024 - arxiv.org
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training
vision encoders to generate image/text representations facilitating various applications …

Revisit large-scale image-caption data in pre-training multimodal foundation models

Z Lai, V Saveris, C Chen, HY Chen, H Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advancements in multimodal models highlight the value of rewritten captions for
improving performance, yet key challenges remain. For example, while synthetic captions …

Scaling inference-time search with vision value model for improved visual comprehension

W **yao, Y Zhengyuan, L Linjie, L Hong**… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite significant advancements in vision-language models (VLMs), there lacks effective
approaches to enhance response quality by scaling inference-time computation. This …

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S **ao, M Chen, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …

Detect, describe, discriminate: Moving beyond vqa for mllm evaluation

M Gaur, M Tapaswi - arxiv preprint arxiv:2409.15125, 2024 - arxiv.org
Visual Question Answering (VQA) with multiple choice questions enables a vision-centric
evaluation of Multimodal Large Language Models (MLLMs). Although it reliably checks the …

Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

X Wang, J Wu, Z Lin, F Zhang… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
Recently, video-language understanding has achieved great success through large-scale
pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively …

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Y Liu, X Li, Z Wang, B Zhao, C **e - arxiv preprint arxiv:2411.16828, 2024 - arxiv.org
Previous works show that noisy, web-crawled image-text pairs may limit vision-language
pretraining like CLIP and propose learning with synthetic captions as a promising …

STIV: Scalable Text and Image Conditioned Video Generation

Z Lin, W Liu, C Chen, J Lu, W Hu, TJ Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
The field of video generation has made remarkable advancements, yet there remains a
pressing need for a clear, systematic recipe that can guide the development of robust and …

TIPS: Text-Image Pretraining with Spatial Awareness

KK Maninis, K Chen, S Ghosh, A Karpur… - arxiv preprint arxiv …, 2024 - arxiv.org
While image-text representation learning has become very popular in recent years, existing
models tend to lack spatial awareness and have limited direct applicability for dense …