Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation

Y Lu, X Yang, X Li, XE Wang… - Advances in Neural …, 2024 - proceedings.neurips.cc
Existing automatic evaluation on text-to-image synthesis can only provide an image-text
matching score, without considering the object-level compositionality, which results in poor …

Multimodal procedural planning via dual text-image prompting

Y Lu, P Lu, Z Chen, W Zhu, XE Wang… - ar** in vision-language models
Z Ma, J Pan, J Chai - arxiv preprint arxiv:2306.08685, 2023 - arxiv.org
The ability to connect language units to their referents in the physical world, referred to as
grounding, is crucial to learning and understanding grounded meanings of words. While …