Google Академія

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Зберегти Послатися Цитовано в 198 джерелах Пов’язані статті Кількість версій: 7 Пошук бібліотеки Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Z Chen, J Wu, W Wang, W Su, G Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com

The exponential growth of large language models (LLMs) has opened up numerous
possibilities for multi-modal AGI systems. However the progress in vision and vision …

Зберегти Послатися Цитовано в 185 джерелах Пов’язані статті Кількість версій: 8 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Laion-5b: An open large-scale dataset for training next generation image-text models

C Schuhmann, R Beaumont, R Vencu… - Advances in neural …, 2022 - proceedings.neurips.cc

Groundbreaking language-vision architectures like CLIP and DALL-E proved the utility of
training on large amounts of noisy image-text data, without relying on expensive accurate …

Зберегти Послатися Цитовано в 3132 джерелах Пов’язані статті Кількість версій: 14 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Chinese clip: Contrastive vision-language pretraining in chinese

A Yang, J Pan, J Lin, R Men, Y Zhang, J Zhou… - arxiv preprint arxiv …, 2022 - arxiv.org

The tremendous success of CLIP (Radford et al., 2021) has promoted the research and
application of contrastive learning for vision-language pretraining. In this work, we construct …

Зберегти Послатися Цитовано в 129 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Altclip: Altering the language encoder in clip for extended language capabilities

Z Chen, G Liu, BW Zhang, F Ye, Q Yang… - arxiv preprint arxiv …, 2022 - arxiv.org

In this work, we present a conceptually simple and effective method to train a strong
bilingual/multilingual multimodal representation model. Starting from the pre-trained …

Зберегти Послатися Цитовано в 77 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Large multilingual models pivot zero-shot multimodal learning across languages

J Hu, Y Yao, C Wang, S Wang, Y Pan, Q Chen… - arxiv preprint arxiv …, 2023 - arxiv.org

Recently there has been a significant surge in multimodal learning in terms of both image-to-
text and text-to-image generation. However, the success is typically limited to English …

Зберегти Послатися Цитовано в 55 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

On the power of foundation models

Y Yuan - International Conference on Machine Learning, 2023 - proceedings.mlr.press

With infinitely many high-quality data points, infinite computational power, an infinitely large
foundation model with a perfect training algorithm and guaranteed zero generalization error …

Зберегти Послатися Цитовано в 49 джерелах Пов’язані статті Кількість версій: 8 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Multilingual diversity improves vision-language representations

T Nguyen, M Wallingford, S Santy… - Advances in …, 2025 - proceedings.neurips.cc

Massive web-crawled image-text datasets lay the foundation for recent progress in
multimodal learning. These datasets are designed with the goal of training a model to do …

Зберегти Послатися Цитовано в 5 джерелах Пов’язані статті Кількість версій: 4 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Cvqa: Culturally-diverse multilingual visual question answering benchmark

D Romero, C Lyu, HA Wibowo, T Lynn, I Hamed… - arxiv preprint arxiv …, 2024 - arxiv.org

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used
to test the ability of vision-language models to understand and reason on knowledge …

Зберегти Послатися Цитовано в 23 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] stanford.edu

[PDF][PDF] Identifying and eliminating csam in generative ml training data and models

D Thiel - Stanford Internet Observatory, Cyber Policy Center …, 2023 - stacks.stanford.edu

Machine learning models that generate visual images are trained on a small number of
datasets of images. Many older models, for example, were trained on the manually labeled …

Зберегти Послатися Цитовано в 51 джерелах Пов’язані статті Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Cross-lingual and multilingual clip

Vision-language pre-training: Basics, recent advances, and future trends

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Laion-5b: An open large-scale dataset for training next generation image-text models

Chinese clip: Contrastive vision-language pretraining in chinese

Altclip: Altering the language encoder in clip for extended language capabilities

Large multilingual models pivot zero-shot multimodal learning across languages

On the power of foundation models

Multilingual diversity improves vision-language representations

Cvqa: Culturally-diverse multilingual visual question answering benchmark

[PDF][PDF] Identifying and eliminating csam in generative ml training data and models