Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arxiv preprint arxiv …, 2024 - arxiv.org
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models

Y Yu, S Chung, BK Lee, YM Ro - arxiv preprint arxiv:2408.12114, 2024 - arxiv.org
Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned
vision inputs. They have made remarkable progress in computer vision tasks by aligning text …

Bridging the reality gap: A benchmark for physical reasoning in general world models with various physical phenomena beyond mechanics

P Zhao, J Xu, N Cheng, H Hu, X Zhang, X Xu… - Expert Systems with …, 2025 - Elsevier
While general world models have demonstrated excellent capability in modeling and
simulating the world through video understanding and generation, their ability to reason …

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models

Z Huang, S Zhong, P Zhou, S Gao, M Zitnik… - arxiv preprint arxiv …, 2025 - arxiv.org
Recently, numerous benchmarks have been developed to evaluate the logical reasoning
abilities of large language models (LLMs). However, assessing the equally important …

50 Shades of Deceptive Patterns: A Unified Taxonomy, Multimodal Detection, and Security Implications

Z Shi, R Sun, J Chen, J Sun, M Xue, Y Gao… - arxiv preprint arxiv …, 2025 - arxiv.org
Deceptive patterns (DPs) are user interface designs deliberately crafted to manipulate users
into unintended decisions, often by exploiting cognitive biases for the benefit of companies …

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

MF Imam, C Lyu, AF Aji - arxiv preprint arxiv:2501.10674, 2025 - arxiv.org
Multimodal Large Language Models (MLLMs) have achieved significant advancements in
tasks like Visual Question Answering (VQA) by leveraging foundational Large Language …

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

B **a, Q Lu, L Zhu, Z **ng, D Zhao, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
The advent of Large Language Models (LLMs) has enabled the development of LLM agents
capable of autonomously achieving under-specified goals and continuously evolving …

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

J Xue, Q Deng, F Yu, Y Wang, J Wang, Y Li - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and
Flamingo, have made significant progress in integrating visual and textual modalities …

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

E Johnson, N Wilson - arxiv preprint arxiv:2501.00917, 2025 - arxiv.org
Text-to-image generation has witnessed significant advancements with the integration of
Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual …