Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance

V Udandarao, A Prabhu, A Ghosh… - Advances in …, 2025 - proceedings.neurips.cc
Web-crawled pretraining datasets underlie the impressive" zero-shot" evaluation
performance of multimodal models, such as CLIP for classification and Stable-Diffusion for …

Visual language pretrained multiple instance zero-shot transfer for histopathology images

MY Lu, B Chen, A Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive visual language pretraining has emerged as a powerful method for either
training new language-aware image encoders or augmenting existing pretrained models …

Hallucination augmented contrastive learning for multimodal large language model

C Jiang, H Xu, M Dong, J Chen, W Ye… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multi-modal large language models (MLLMs) have been shown to efficiently integrate
natural language with visual information to handle multi-modal tasks. However MLLMs still …

Dual memory networks: A versatile adaptation approach for vision-language models

Y Zhang, W Zhu, H Tang, Z Ma… - Proceedings of the …, 2024 - openaccess.thecvf.com
With the emergence of pre-trained vision-language models like CLIP how to adapt them to
various downstream classification tasks has garnered significant attention in recent …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Detecting and grounding multi-modal media manipulation

R Shao, T Wu, Z Liu - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
Misinformation has become a pressing issue. Fake media, in both visual and textual forms,
is widespread on the web. While various deepfake detection and text fake news detection …