Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
Vision-language pre-training: Basics, recent advances, and future trends
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …
intelligence that have been developed in the last few years. We group these approaches …
mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration
Abstract Multi-modal Large Language Models (MLLMs) have demonstrated impressive
instruction abilities across various open-ended tasks. However previous methods have …
instruction abilities across various open-ended tasks. However previous methods have …
Scaling language-image pre-training via masking
Abstract We present Fast Language-Image Pre-training (FLIP), a simple and more efficient
method for training CLIP. Our method randomly masks out and removes a large portion of …
method for training CLIP. Our method randomly masks out and removes a large portion of …
Contrastive audio-visual masked autoencoder
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …
Veclip: Improving clip training via visual-enriched captions
Large-scale web-crawled datasets are fundamental for the success of pre-training vision-
language models, such as CLIP. However, the inherent noise and potential irrelevance of …
language models, such as CLIP. However, the inherent noise and potential irrelevance of …
Vision-language models for medical report generation and visual question answering: A review
Medical vision-language models (VLMs) combine computer vision (CV) and natural
language processing (NLP) to analyze visual and textual medical data. Our paper reviews …
language processing (NLP) to analyze visual and textual medical data. Our paper reviews …
Hard patches mining for masked image modeling
Masked image modeling (MIM) has attracted much research attention due to its promising
potential for learning scalable visual representations. In typical approaches, models usually …
potential for learning scalable visual representations. In typical approaches, models usually …
Context-aware alignment and mutual masking for 3d-language pre-training
Abstract 3D visual language reasoning plays an important role in effective human-computer
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …
interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre …
Deep learning for cross-domain data fusion in urban computing: Taxonomy, advances, and outlook
As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for
sustainable development by harnessing the power of cross-domain data fusion from diverse …
sustainable development by harnessing the power of cross-domain data fusion from diverse …