Veclip: Improving clip training via visual-enriched captions

Z Lai, H Zhang, B Zhang, W Wu, H Bai… - … on Computer Vision, 2024 - Springer
Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …

No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance

V Udandarao, A Prabhu, A Ghosh… - The Thirty-eighth …, 2024 - openreview.net
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …

Scaling Laws for Data Filtering--Data Curation cannot be Compute Agnostic

S Goyal, P Maini, ZC Lipton… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully
selected subsets of massive web scrapes. For instance the LAION public dataset retained …

Sieve: Multimodal dataset pruning using image captioning models

A Mahmoud, M Elhoushi, A Abbas… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-
crawled datasets. This underscores the critical need for dataset pruning as the quality of …

From scarcity to efficiency: Improving clip training via visual-enriched captions

Z Lai, H Zhang, W Wu, H Bai, A Timofeev, X Du, Z Gan… - 2023 - openreview.net
Web-crawled datasets are pivotal to the success of pre-training vision-language models,
exemplified by CLIP. However, web-crawled AltTexts can be noisy and potentially irrelevant …

Hype: Hyperbolic entailment filtering for underspecified images and texts

W Kim, S Chun, T Kim, D Han, S Yun - European Conference on Computer …, 2024 - Springer
In an era where the volume of data drives the effectiveness of self-supervised learning, the
specificity and clarity of data semantics play a crucial role in model training. Addressing this …

Rephrasing the web: A recipe for compute and data-efficient language modeling

P Maini, S Seto, H Bai, D Grangier, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models are trained on massive scrapes of the web, which are often
unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such …

Parrot captions teach clip to spot text

Y Lin, C He, AJ Wang, B Wang, W Li… - European Conference on …, 2024 - Springer
Despite CLIP being the foundation model in numerous vision-language applications, CLIP
suffers from a severe text spotting bias. Such bias causes CLIP models to 'Parrot'the visual …

An introduction to vision-language modeling

F Bordes, RY Pang, A Ajay, AC Li, A Bardes… - arxiv preprint arxiv …, 2024 - arxiv.org
Following the recent popularity of Large Language Models (LLMs), several attempts have
been made to extend them to the visual domain. From having a visual assistant that could …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …