Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Deep learning approaches on image captioning: A review

T Ghandi, H Pourreza, H Mahyar - ACM Computing Surveys, 2023 - dl.acm.org
Image captioning is a research area of immense importance, aiming to generate natural
language descriptions for visual content in the form of still images. The advent of deep …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

Git: A generative image-to-text transformer for vision and language

J Wang, Z Yang, X Hu, L Li, K Lin, Z Gan, Z Liu… - arxiv preprint arxiv …, 2022 - arxiv.org
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …

Coarse-to-fine vision-language pre-training with fusion in the backbone

ZY Dou, A Kamath, Z Gan, P Zhang… - Advances in neural …, 2022 - proceedings.neurips.cc
Abstract Vision-language (VL) pre-training has recently received considerable attention.
However, most existing end-to-end pre-training approaches either only aim to tackle VL …

Semantic-conditional diffusion networks for image captioning

J Luo, Y Li, Y Pan, T Yao, J Feng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recent advances on text-to-image generation have witnessed the rise of diffusion models
which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent …

Meacap: Memory-augmented zero-shot image captioning

Z Zeng, Y **e, H Zhang, C Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Zero-shot image captioning (IC) without well-paired image-text data can be categorized into
two main types: training-free and text-only-training methods. While both types integrate pre …

Caption anything: Interactive image description with diverse multimodal controls

T Wang, J Zhang, J Fei, H Zheng, Y Tang, Z Li… - arxiv preprint arxiv …, 2023 - arxiv.org
Controllable image captioning is an emerging multimodal topic that aims to describe the
image with natural language following human purpose, $\textit {eg} $, looking at the …

Tag2text: Guiding vision-language model via image tagging

X Huang, Y Zhang, J Ma, W Tian, R Feng… - arxiv preprint arxiv …, 2023 - arxiv.org
This paper presents Tag2Text, a vision language pre-training (VLP) framework, which
introduces image tagging into vision-language models to guide the learning of visual …

Conzic: Controllable zero-shot image captioning by sampling-based polishing

Z Zeng, H Zhang, R Lu, D Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Zero-shot capability has been considered as a new revolution of deep learning, letting
machines work on tasks without curated training data. As a good start and the only existing …