Blink: Multimodal large language models can see but not perceive

X Fu, Y Hu, B Li, Y Feng, H Wang, X Lin, D Roth… - … on Computer Vision, 2024 - Springer
We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses
on core visual perception abilities not found in other evaluations. Most of the Blink tasks can …

Evaluating text-to-visual generation with image-to-text generation

Z Lin, D Pathak, B Li, J Li, X **a, G Neubig… - … on Computer Vision, 2024 - Springer
Despite significant progress in generative AI, comprehensive evaluation remains
challenging because of the lack of effective metrics and standardized benchmarks. For …

Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

D Muhtar, Z Li, F Gu, X Zhang, P **ao - European Conference on …, 2024 - Springer
The revolutionary capabilities of large language models (LLMs) have paved the way for
multimodal large language models (MLLMs) and fostered diverse applications across …

Kangaroo: A powerful video-language model supporting long-context video input

J Liu, Y Wang, H Ma, X Wu, X Ma, X Wei, J Jiao… - arxiv preprint arxiv …, 2024 - arxiv.org
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …

Mme-survey: A comprehensive survey on evaluation of multimodal llms

C Fu, YF Zhang, S Yin, B Li, X Fang, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language
Models (MLLMs) have garnered increased attention from both industry and academia …

Automated evaluation of large vision-language models on self-driving corner cases

K Chen, Y Li, W Zhang, Y Liu, P Li, R Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Vision-Language Models (LVLMs) have received widespread attention for advancing
the interpretable self-driving. Existing evaluations of LVLMs primarily focus on multi-faceted …

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline
Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously …

A survey on evaluation of multimodal large language models

J Huang, J Zhang - arxiv preprint arxiv:2408.15769, 2024 - arxiv.org
Multimodal Large Language Models (MLLMs) mimic human perception and reasoning
system by integrating powerful Large Language Models (LLMs) with various modality …

[PDF][PDF] Vhelm: A holistic evaluation of vision language models

T Lee, H Tu, CH Wong, W Zheng… - arxiv preprint arxiv …, 2024 - proceedings.neurips.cc
Current benchmarks for assessing vision-language models (VLMs) often focus on their
perception or problem-solving capabilities and neglect other critical aspects such as …