Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang… - Advances in …, 2025 - proceedings.neurips.cc
Abstract We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation of text-to …

Are we on the right way for evaluating large vision-language models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Mova: Adapting mixture of vision experts to multimodal context

Z Zong, B Ma, D Shen, G Song, H Shao, D Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
As the key component in multimodal large language models (MLLMs), the ability of the
visual encoder greatly affects MLLM's understanding on diverse image content. Although …

Calibrated self-rewarding vision language models

Y Zhou, Z Fan, D Cheng, S Yang, Z Chen, C Cui… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-
trained large language models (LLMs) and vision models through instruction tuning. Despite …

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

D Chen, R Chen, S Pu, Z Liu, Y Wu, C Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Many real-world user queries (eg" How do to make egg fried rice?") could benefit from
systems capable of generating responses with both textual steps with accompanying …

FaceXBench: Evaluating Multimodal LLMs on Face Understanding

K Narayan, V VS, VM Patel - arxiv preprint arxiv:2501.10360, 2025 - arxiv.org
Multimodal Large Language Models (MLLMs) demonstrate impressive problem-solving
abilities across a wide range of tasks and domains. However, their capacity for face …

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

K Zhu, P **a, Y Li, H Zhu, S Wang, H Yao - arxiv preprint arxiv …, 2024 - arxiv.org
The advancement of Large Vision-Language Models (LVLMs) has propelled their
application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality …