A survey on hallucination in large vision-language models

H Liu, W Xue, Y Chen, D Chen, X Zhao, K Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent development of Large Vision-Language Models (LVLMs) has attracted growing
attention within the AI landscape for its practical implementation potential. However,`` …

From large language models to large multimodal models: A literature review

D Huang, C Yan, Q Li, X Peng - Applied Sciences, 2024 - mdpi.com
With the deepening of research on Large Language Models (LLMs), significant progress has
been made in recent years on the development of Large Multimodal Models (LMMs), which …

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

X Yue, Y Ni, K Zhang, T Zheng, R Liu… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce MMMU: a new benchmark designed to evaluate multimodal models on
massive multi-discipline tasks demanding college-level subject knowledge and deliberate …

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Visual autoregressive modeling: Scalable image generation via next-scale prediction

K Tian, Y Jiang, Z Yuan, B Peng… - Advances in neural …, 2025 - proceedings.neurips.cc
Abstract We present Visual AutoRegressive modeling (VAR), a new generation paradigm
that redefines the autoregressive learning on images as coarse-to-fine" next-scale …

Sharegpt4video: Improving video understanding and generation with better captions

L Chen, X Wei, J Li, X Dong, P Zhang… - Advances in …, 2025 - proceedings.neurips.cc
Abstract We present the ShareGPT4Video series, aiming to facilitate the video
understanding of large video-language models (LVLMs) and the video generation of text-to …

Internvideo2: Scaling foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - … on Computer Vision, 2024 - Springer
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …

Are we on the right way for evaluating large vision-language models?

L Chen, J Li, X Dong, P Zhang, Y Zang, Z Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking
numerous studies to evaluate their multi-modal capabilities. However, we dig into current …

Moe-llava: Mixture of experts for large vision-language models

B Lin, Z Tang, Y Ye, J Cui, B Zhu, P **, J Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances demonstrate that scaling Large Vision-Language Models (LVLMs)
effectively improves downstream task performances. However, existing scaling methods …

Paligemma: A versatile 3b vlm for transfer

L Beyer, A Steiner, AS Pinto, A Kolesnikov… - arxiv preprint arxiv …, 2024 - arxiv.org
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …