Worldgpt: Empowering llm as multimodal world model

Z Ge, H Huang, M Zhou, J Li, G Wang, S Tang… - Proceedings of the …, 2024 - dl.acm.org
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …

Building and better understanding vision-language models: insights and future directions

H Laurençon, A Marafioti, V Sanh… - … on Responsibly Building …, 2024 - openreview.net
The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

J Zhang, L Xue, L Song, J Wang, W Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
With the rise of multimodal applications, instruction data has become critical for training
multimodal language models capable of understanding complex image-based queries …

Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?

C Liu, Z Wan, H Wang, Y Chen, T Qaiser, C **… - arxiv preprint arxiv …, 2024 - arxiv.org
Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling
zero-shot tasks for medical image understanding. However, training MedVLP models …

Vision-Language Model Dialog Games for Self-Improvement

K Konyushkova, C Kaplanis, S Cabi, M Denil - arxiv preprint arxiv …, 2025 - arxiv.org
The increasing demand for high-quality, diverse training data poses a significant bottleneck
in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a …

Image, Text, and Speech Data Augmentation using Multimodal LLMs for Deep Learning: A Survey

R Sapkota, S Raza, M Shoman, A Paudel… - arxiv preprint arxiv …, 2025 - arxiv.org
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep
Learning (DL) approaches to leveraging Large Language Models (LLMs), including …

MM-CARP: Multimodal Model with Cross-Modal Retrieval-Augmented and Visual Region Perception

J Guo, C Fu, G Wang, R Lu, D Chen, S Tang - International Conference on …, 2024 - Springer
Cross-modal visual information has been demonstrated to enhance the performance of
unimodal text tasks. However, efficiently acquiring and utilizing this cross-modal visual …

[PDF][PDF] Continuous or Discrete, That Is the Question: A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension

Z Li, J Zhang, D Wang, Y Wang, X Huang, Z Wei - 2024 - preprints.org
With the success of large language models (LLMs) driving progress towards general-
purpose AI, there has been a growing focus on extending these models to multi-modal …