Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration

Q Ye, H Xu, J Ye, M Yan, A Hu, H Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …

Scaling language-image pre-training via masking

Y Li, H Fan, R Hu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arxiv preprint arxiv …, 2022 - arxiv.org
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …

Veclip: Improving clip training via visual-enriched captions

Z Lai, H Zhang, B Zhang, W Wu, H Bai… - … on Computer Vision, 2024 - Springer
Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …

Vision-language models for medical report generation and visual question answering: A review

I Hartsock, G Rasool - Frontiers in Artificial Intelligence, 2024 - frontiersin.org
Medical vision-language models (VLMs) combine computer vision (CV) and natural
language processing (NLP) to analyze visual and textual medical data. Our paper reviews …

Hard patches mining for masked image modeling

H Wang, K Song, J Fan, Y Wang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked image modeling (MIM) has attracted much research attention due to its promising
potential for learning scalable visual representations. In typical approaches, models usually …

Context-aware alignment and mutual masking for 3d-language pre-training

Z **, M Hayat, Y Yang, Y Guo… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …

Deep learning for cross-domain data fusion in urban computing: Taxonomy, advances, and outlook

X Zou, Y Yan, X Hao, Y Hu, H Wen, E Liu, J Zhang… - Information …, 2025 - Elsevier
As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for
sustainable development by harnessing the power of cross-domain data fusion from diverse …