Google Академія

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com

This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Зберегти Послатися Цитовано в 198 джерелах Пов’язані статті Кількість версій: 7 Пошук бібліотеки Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer

In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Зберегти Послатися Цитовано в 219 джерелах Пов’язані статті Кількість версій: 8

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Datacomp: In search of the next generation of multimodal datasets

SY Gadre, G Ilharco, A Fang… - Advances in …, 2023 - proceedings.neurips.cc

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …

Зберегти Послатися Цитовано в 374 джерелах Пов’язані статті Кількість версій: 12 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] mlr.press

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action

D Shah, B Osiński, S Levine - Conference on robot …, 2023 - proceedings.mlr.press

Goal-conditioned policies for robotic navigation can be trained on large, unannotated
datasets, providing for good generalization to real-world settings. However, particularly in …

Зберегти Послатися Цитовано в 444 джерелах Пов’язані статті Кількість версій: 6 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - ar**_Pictures_to_Words_for_Zero-Shot_Composed_Image_Retrieval_CVPR_2023_paper.pdf" data-clk="hl=uk&sa=T&oi=gga&ct=gga&cd=5&d=863183330860509743&ei=ffa7Z_fNIuehieoPx-axmAs" data-clk-atid="L6r6g6qk-gsJ" target="_blank">[PDF] thecvf.com

Pic2word: Map** pictures to words for zero-shot composed image retrieval

K Saito, K Sohn, X Zhang, CL Li… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …

Зберегти Послатися Цитовано в 109 джерелах Пов’язані статті Кількість версій: 11 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models

Z Lin, S Yu, Z Kuang, D Pathak… - Proceedings of the …, 2023 - openaccess.thecvf.com

The ability to quickly learn a new task with minimal instruction-known as few-shot learning-is
a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …

Зберегти Послатися Цитовано в 110 джерелах Пов’язані статті Кількість версій: 10 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] researchgate.net

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning

S Dong, L Wang, B Du, X Meng - ISPRS Journal of Photogrammetry and …, 2024 - Elsevier

Remote sensing change detection (RSCD), which aims to identify surface changes from
bitemporal images, is significant for many applications, such as environmental protection …

Зберегти Послатися Цитовано в 46 джерелах Пов’язані статті Кількість версій: 3

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Weakly supervised 3d open-vocabulary segmentation

K Liu, F Zhan, J Zhang, M Xu, Y Yu… - Advances in …, 2023 - proceedings.neurips.cc

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception
and thus a crucial objective in computer vision research. However, this task is heavily …

Зберегти Послатися Цитовано в 52 джерелах Пов’язані статті Кількість версій: 7 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Text-only training for image captioning using noise-injected clip

D Nukrai, R Mokady, A Globerson - arxiv preprint arxiv:2211.00575, 2022 - arxiv.org

We consider the task of image-captioning using only the CLIP model and additional text data
at training time, and no additional captioned images. Our approach relies on the fact that …

Зберегти Послатися Цитовано в 107 джерелах Пов’язані статті Кількість версій: 5 Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Clip models are few-shot learners: Empirical studies on vqa and visual entailment

Vision-language pre-training: Basics, recent advances, and future trends

Vlp: A survey on vision-language pre-training

Datacomp: In search of the next generation of multimodal datasets

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action

Socratic models: Composing zero-shot multimodal reasoning with language

Pic2word: Map** pictures to words for zero-shot composed image retrieval

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning

Weakly supervised 3d open-vocabulary segmentation

Text-only training for image captioning using noise-injected clip