Worldgpt: Empowering llm as multimodal world model
World models are progressively being employed across diverse fields, extending from basic
environment simulation to complex scenario construction. However, existing models are …
environment simulation to complex scenario construction. However, existing models are …
Building and better understanding vision-language models: insights and future directions
The field of vision-language models (VLMs), which take images and texts as inputs and
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …
output texts, is rapidly evolving and has yet to reach consensus on several key aspects of …
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
With the rise of multimodal applications, instruction data has become critical for training
multimodal language models capable of understanding complex image-based queries …
multimodal language models capable of understanding complex image-based queries …
Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?
Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling
zero-shot tasks for medical image understanding. However, training MedVLP models …
zero-shot tasks for medical image understanding. However, training MedVLP models …
Vision-Language Model Dialog Games for Self-Improvement
The increasing demand for high-quality, diverse training data poses a significant bottleneck
in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a …
in advancing vision-language models (VLMs). This paper presents VLM Dialog Games, a …
Image, Text, and Speech Data Augmentation using Multimodal LLMs for Deep Learning: A Survey
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep
Learning (DL) approaches to leveraging Large Language Models (LLMs), including …
Learning (DL) approaches to leveraging Large Language Models (LLMs), including …
MM-CARP: Multimodal Model with Cross-Modal Retrieval-Augmented and Visual Region Perception
Cross-modal visual information has been demonstrated to enhance the performance of
unimodal text tasks. However, efficiently acquiring and utilizing this cross-modal visual …
unimodal text tasks. However, efficiently acquiring and utilizing this cross-modal visual …
[PDF][PDF] Continuous or Discrete, That Is the Question: A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension
With the success of large language models (LLMs) driving progress towards general-
purpose AI, there has been a growing focus on extending these models to multi-modal …
purpose AI, there has been a growing focus on extending these models to multi-modal …