Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Datacomp: In search of the next generation of multimodal datasets

SY Gadre, G Ilharco, A Fang… - Advances in …, 2023 - proceedings.neurips.cc
Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable
Diffusion and GPT-4, yet their design does not receive the same research attention as model …

Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action

D Shah, B Osiński, S Levine - Conference on robot …, 2023 - proceedings.mlr.press
Goal-conditioned policies for robotic navigation can be trained on large, unannotated
datasets, providing for good generalization to real-world settings. However, particularly in …

Socratic models: Composing zero-shot multimodal reasoning with language

A Zeng, M Attarian, B Ichter, K Choromanski… - ar**_Pictures_to_Words_for_Zero-Shot_Composed_Image_Retrieval_CVPR_2023_paper.pdf" data-clk="hl=uk&sa=T&oi=gga&ct=gga&cd=5&d=863183330860509743&ei=ffa7Z_fNIuehieoPx-axmAs" data-clk-atid="L6r6g6qk-gsJ" target="_blank">[PDF] thecvf.com

Pic2word: Map** pictures to words for zero-shot composed image retrieval

K Saito, K Sohn, X Zhang, CL Li… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of CIR models …

Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models

Z Lin, S Yu, Z Kuang, D Pathak… - Proceedings of the …, 2023 - openaccess.thecvf.com
The ability to quickly learn a new task with minimal instruction-known as few-shot learning-is
a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot …

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning

S Dong, L Wang, B Du, X Meng - ISPRS Journal of Photogrammetry and …, 2024 - Elsevier
Remote sensing change detection (RSCD), which aims to identify surface changes from
bitemporal images, is significant for many applications, such as environmental protection …

Weakly supervised 3d open-vocabulary segmentation

K Liu, F Zhan, J Zhang, M Xu, Y Yu… - Advances in …, 2023 - proceedings.neurips.cc
Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception
and thus a crucial objective in computer vision research. However, this task is heavily …

Text-only training for image captioning using noise-injected clip

D Nukrai, R Mokady, A Globerson - arxiv preprint arxiv:2211.00575, 2022 - arxiv.org
We consider the task of image-captioning using only the CLIP model and additional text data
at training time, and no additional captioned images. Our approach relies on the fact that …