Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022‏ - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Image-text retrieval: A survey on recent research and development

M Cao, S Li, J Li, L Nie, M Zhang - arxiv preprint arxiv:2203.14713, 2022‏ - arxiv.org
In the past few years, cross-modal image-text retrieval (ITR) has experienced increased
interest in the research community due to its excellent research value and broad real-world …

Unified contrastive learning in image-text-label space

J Yang, C Li, P Zhang, B **ao, C Liu… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Visual recognition is recently learned via either supervised learning on human-annotated
image-label data or language-image contrastive learning with webly-crawled image-text …

Towards language-free training for text-to-image generation

Y Zhou, R Zhang, C Chen, C Li… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
One of the major challenges in training text-to-image generation models is the need of a
large number of high-quality text-image pairs. While image samples are often easily …

Transvg: End-to-end visual grounding with transformers

J Deng, Z Yang, T Chen, W Zhou… - Proceedings of the IEEE …, 2021‏ - openaccess.thecvf.com
In this paper, we present a neat yet effective transformer-based framework for visual
grounding, namely TransVG, to address the task of grounding a language query to the …

Seqtr: A simple yet universal network for visual grounding

C Zhu, Y Zhou, Y Shen, G Luo, X Pan, M Lin… - … on Computer Vision, 2022‏ - Springer
In this paper, we propose a simple yet universal network termed SeqTR for visual grounding
tasks, eg, phrase localization, referring expression comprehension (REC) and segmentation …

Improving visual grounding with visual-linguistic verification and iterative reasoning

L Yang, Y Xu, C Yuan, W Liu, B Li… - Proceedings of the …, 2022‏ - openaccess.thecvf.com
Visual grounding is a task to locate the target indicated by a natural language expression.
Existing methods extend the generic object detection framework to this problem. They base …

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

A Miech, D Zhukov, JB Alayrac… - Proceedings of the …, 2019‏ - openaccess.thecvf.com
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …

Self-supervised multimodal versatile networks

JB Alayrac, A Recasens, R Schneider… - Advances in neural …, 2020‏ - proceedings.neurips.cc
Videos are a rich source of multi-modal supervision. In this work, we learn representations
using self-supervision by leveraging three modalities naturally present in videos: visual …

Multi-modality cross attention network for image and sentence matching

X Wei, T Zhang, Y Li, Y Zhang… - Proceedings of the IEEE …, 2020‏ - openaccess.thecvf.com
The key of image and sentence matching is to accurately measure the visual-semantic
similarity between an image and a sentence. However, most existing methods make use of …