Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

A comprehensive survey of deep learning for image captioning

MDZ Hossain, F Sohel, MF Shiratuddin… - ACM Computing Surveys …, 2019‏ - dl.acm.org
Generating a description of an image is called image captioning. Image captioning requires
recognizing the important objects, their attributes, and their relationships in an image. It also …

Imagebind: One embedding space to bind them all

R Girdhar, A El-Nouby, Z Liu, M Singh… - Proceedings of the …, 2023‏ - openaccess.thecvf.com
We present ImageBind, an approach to learn a joint embedding across six different
modalities-images, text, audio, depth, thermal, and IMU data. We show that all combinations …

[PDF][PDF] Large-scale domain-specific pretraining for biomedical vision-language processing

S Zhang, Y Xu, N Usuyama, J Bagga… - arxiv preprint arxiv …, 2023‏ - researchgate.net
Contrastive pretraining on parallel image-text data has attained great success in vision-
language processing (VLP), as exemplified by CLIP and related methods. However, prior …

Scaling up visual and vision-language representation learning with noisy text supervision

C Jia, Y Yang, Y **a, YT Chen… - International …, 2021‏ - proceedings.mlr.press
Pre-trained representations are becoming crucial for many NLP and perception tasks. While
representation learning in NLP has transitioned to training on raw text without human …

Clip-forge: Towards zero-shot text-to-shape generation

A Sanghi, H Chu, JG Lambourne… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Generating shapes using natural language can enable new ways of imagining and creating
the things around us. While significant recent progress has been made in text-to-image …

Multi-modality cross attention network for image and sentence matching

X Wei, T Zhang, Y Li, Y Zhang… - Proceedings of the IEEE …, 2020‏ - openaccess.thecvf.com
The key of image and sentence matching is to accurately measure the visual-semantic
similarity between an image and a sentence. However, most existing methods make use of …

[HTML][HTML] Combined scaling for zero-shot transfer learning

H Pham, Z Dai, G Ghiasi, K Kawaguchi, H Liu, AW Yu… - Neurocomputing, 2023‏ - Elsevier
Recent developments in multimodal training methodologies, including CLIP and ALIGN,
obviate the necessity for individual data labeling. These approaches utilize pairs of data and …

Deep multimodal representation learning: A survey

W Guo, J Wang, S Wang - Ieee Access, 2019‏ - ieeexplore.ieee.org
Multimodal representation learning, which aims to narrow the heterogeneity gap among
different modalities, plays an indispensable role in the utilization of ubiquitous multimodal …

Stacked cross attention for image-text matching

KH Lee, X Chen, G Hua, H Hu… - Proceedings of the …, 2018‏ - openaccess.thecvf.com
In this paper, we study the problem of image-text matching. Inferring the latent semantic
alignment between objects or other salient stuff (eg snow, sky, lawn) and the corresponding …