Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in neural …, 2022 - proceedings.neurips.cc
Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

Chinese clip: Contrastive vision-language pretraining in chinese

A Yang, J Pan, J Lin, R Men, Y Zhang, J Zhou… - arxiv preprint arxiv …, 2022 - arxiv.org
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arxiv preprint arxiv …, 2022 - arxiv.org
In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

Large multilingual models pivot zero-shot multimodal learning across languages

J Hu, Y Yao, C Wang, S Wang, Y Pan, Q Chen… - arxiv preprint arxiv …, 2023 - arxiv.org
Recently there has been a significant surge in multimodal learning in terms of both image-to-
text and text-to-image generation. However, the success is typically limited to English …

On the power of foundation models

Y Yuan - International Conference on Machine Learning, 2023 - proceedings.mlr.press
With infinitely many high-quality data points, infinite computational power, an infinitely large
foundation model with a perfect training algorithm and guaranteed zero generalization error …

Multilingual diversity improves vision-language representations

T Nguyen, M Wallingford, S Santy… - Advances in …, 2025 - proceedings.neurips.cc
Massive web-crawled image-text datasets lay the foundation for recent progress in
multimodal learning. These datasets are designed with the goal of training a model to do …

Cvqa: Culturally-diverse multilingual visual question answering benchmark

D Romero, C Lyu, HA Wibowo, T Lynn, I Hamed… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used
to test the ability of vision-language models to understand and reason on knowledge …

[PDF][PDF] Identifying and eliminating csam in generative ml training data and models

D Thiel - Stanford Internet Observatory, Cyber Policy Center …, 2023 - stacks.stanford.edu
Machine learning models that generate visual images are trained on a small number of
datasets of images. Many older models, for example, were trained on the manually labeled …