Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Large-scale text-to-image generation models for visual artists' creative works

HK Ko, G Park, H Jeon, J Jo, J Kim, J Seo - Proceedings of the 28th …, 2023 - dl.acm.org
Large-scale Text-to-image Generation Models (LTGMs)(eg, DALL-E), self-supervised deep
learning models trained on a huge dataset, have demonstrated the capacity for generating …

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

H Chen, Y Zhang, X Cun, M ** counterfactuals for photorealistic object removal and insertion
D Winter, M Cohen, S Fruchter, Y Pritch… - … on Computer Vision, 2024 - Springer
Diffusion models have revolutionized image editing but often generate images that violate
physical laws, particularly the effects of objects on the scene, eg, occlusions, shadows, and …

Generative disco: Text-to-video generation for music visualization

V Liu, T Long, N Raw, L Chilton - arxiv preprint arxiv:2304.08551, 2023 - arxiv.org
Visuals can enhance our experience of music, owing to the way they can amplify the
emotions and messages conveyed within it. However, creating music visualization is a …