Mm-llms: Recent advances in multimodal large language models

D Zhang, Y Yu, J Dong, C Li, D Su, C Chu… - arxiv preprint arxiv …, 2024 - arxiv.org
In the past year, MultiModal Large Language Models (MM-LLMs) have undergone
substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs …

Vision-language pre-training: Basics, recent advances, and future trends

Z Gan, L Li, C Li, L Wang, Z Liu… - Foundations and Trends …, 2022 - nowpublishers.com
This monograph surveys vision-language pre-training (VLP) methods for multimodal
intelligence that have been developed in the last few years. We group these approaches …

Multimodal learning with transformers: A survey

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org
Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

Large-scale multi-modal pre-trained models: A comprehensive survey

X Wang, G Chen, G Qian, P Gao, XY Wei… - Machine Intelligence …, 2023 - Springer
With the urgent demand for generalized deep models, many pre-trained big models are
proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT) …

Visual language pretrained multiple instance zero-shot transfer for histopathology images

MY Lu, B Chen, A Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Contrastive visual language pretraining has emerged as a powerful method for either
training new language-aware image encoders or augmenting existing pretrained models …

A survey of vision-language pre-trained models

Y Du, Z Liu, J Li, WX Zhao - arxiv preprint arxiv:2202.10936, 2022 - arxiv.org
As transformer evolves, pre-trained models have advanced at a breakneck pace in recent
years. They have dominated the mainstream techniques in natural language processing …

Vlp: A survey on vision-language pre-training

FL Chen, DZ Zhang, ML Han, XY Chen, J Shi… - Machine Intelligence …, 2023 - Springer
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …

Filtering, distillation, and hard negatives for vision-language pre-training

F Radenovic, A Dubey, A Kadian… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision-language models trained with contrastive learning on large-scale noisy data are
becoming increasingly popular for zero-shot recognition problems. In this paper we improve …

Hallucination augmented contrastive learning for multimodal large language model

C Jiang, H Xu, M Dong, J Chen, W Ye… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multi-modal large language models (MLLMs) have been shown to efficiently integrate
natural language with visual information to handle multi-modal tasks. However MLLMs still …

Promptstyler: Prompt-driven style generation for source-free domain generalization

J Cho, G Nam, S Kim, H Yang… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
In a joint vision-language space, a text feature (eg, from" a photo of a dog") could effectively
represent its relevant image features (eg, from dog photos). Also, a recent study has …