Explainable and interpretable multimodal large language models: A comprehensive survey
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …
large language models (LLMs) and computer vision (CV) systems driving advancements in …
Naturalbench: Evaluating vision-language models on natural adversarial samples
Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …
Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models
Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned
vision inputs. They have made remarkable progress in computer vision tasks by aligning text …
vision inputs. They have made remarkable progress in computer vision tasks by aligning text …
Bridging the reality gap: A benchmark for physical reasoning in general world models with various physical phenomena beyond mechanics
While general world models have demonstrated excellent capability in modeling and
simulating the world through video understanding and generation, their ability to reason …
simulating the world through video understanding and generation, their ability to reason …
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models
Recently, numerous benchmarks have been developed to evaluate the logical reasoning
abilities of large language models (LLMs). However, assessing the equally important …
abilities of large language models (LLMs). However, assessing the equally important …
50 Shades of Deceptive Patterns: A Unified Taxonomy, Multimodal Detection, and Security Implications
Deceptive patterns (DPs) are user interface designs deliberately crafted to manipulate users
into unintended decisions, often by exploiting cognitive biases for the benefit of companies …
into unintended decisions, often by exploiting cognitive biases for the benefit of companies …
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!
Multimodal Large Language Models (MLLMs) have achieved significant advancements in
tasks like Visual Question Answering (VQA) by leveraging foundational Large Language …
tasks like Visual Question Answering (VQA) by leveraging foundational Large Language …
An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture
The advent of Large Language Models (LLMs) has enabled the development of LLM agents
capable of autonomously achieving under-specified goals and continuously evolving …
capable of autonomously achieving under-specified goals and continuously evolving …
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering
J Xue, Q Deng, F Yu, Y Wang, J Wang, Y Li - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and
Flamingo, have made significant progress in integrating visual and textual modalities …
Flamingo, have made significant progress in integrating visual and textual modalities …
Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models
E Johnson, N Wilson - arxiv preprint arxiv:2501.00917, 2025 - arxiv.org
Text-to-image generation has witnessed significant advancements with the integration of
Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual …
Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual …