- Academic Search

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

保存引用被引用次数：5 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Naturalbench: Evaluating vision-language models on natural adversarial samples

B Li, Z Lin, W Peng, JD Nyandwi, D Jiang, Z Ma… - arxiv preprint arxiv …, 2024 - arxiv.org

Vision-language models (VLMs) have made significant progress in recent visual-question-
answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However …

保存引用被引用次数：6 相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models

Y Yu, S Chung, BK Lee, YM Ro - arxiv preprint arxiv:2408.12114, 2024 - arxiv.org

Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned
vision inputs. They have made remarkable progress in computer vision tasks by aligning text …

保存引用被引用次数：2 相关文章所有 4 个版本 HTML 版

[Free GPT-4]

[PDF] openreview.net

Bridging the reality gap: A benchmark for physical reasoning in general world models with various physical phenomena beyond mechanics

P Zhao, J Xu, N Cheng, H Hu, X Zhang, X Xu… - Expert Systems with …, 2025 - Elsevier

While general world models have demonstrated excellent capability in modeling and
simulating the world through video understanding and generation, their ability to reason …

保存引用相关文章所有 2 个版本

[Free GPT-4]

[PDF] arxiv.org

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models

Z Huang, S Zhong, P Zhou, S Gao, M Zitnik… - arxiv preprint arxiv …, 2025 - arxiv.org

Recently, numerous benchmarks have been developed to evaluate the logical reasoning
abilities of large language models (LLMs). However, assessing the equally important …

保存引用相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

50 Shades of Deceptive Patterns: A Unified Taxonomy, Multimodal Detection, and Security Implications

Z Shi, R Sun, J Chen, J Sun, M Xue, Y Gao… - arxiv preprint arxiv …, 2025 - arxiv.org

Deceptive patterns (DPs) are user interface designs deliberately crafted to manipulate users
into unintended decisions, often by exploiting cognitive biases for the benefit of companies …

保存引用相关文章所有 2 个版本 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

MF Imam, C Lyu, AF Aji - arxiv preprint arxiv:2501.10674, 2025 - arxiv.org

Multimodal Large Language Models (MLLMs) have achieved significant advancements in
tasks like Visual Question Answering (VQA) by leveraging foundational Large Language …

保存引用相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

B **a, Q Lu, L Zhu, Z **ng, D Zhao, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org

The advent of Large Language Models (LLMs) has enabled the development of LLM agents
capable of autonomously achieving under-specified goals and continuously evolving …

保存引用相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

J Xue, Q Deng, F Yu, Y Wang, J Wang, Y Li - arxiv preprint arxiv …, 2024 - arxiv.org

Multimodal large language models (MLLMs), such as GPT-4o, Gemini, LLaVA, and
Flamingo, have made significant progress in integrating visual and textual modalities …

保存引用相关文章 HTML 版

[Free GPT-4]

[PDF] arxiv.org

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

E Johnson, N Wilson - arxiv preprint arxiv:2501.00917, 2025 - arxiv.org

Text-to-image generation has witnessed significant advancements with the integration of
Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual …

保存引用相关文章 HTML 版

创建快讯

引用

高级搜索

已保存到“我的图书馆”

A survey on benchmarks of multimodal large language models

Explainable and interpretable multimodal large language models: A comprehensive survey

Naturalbench: Evaluating vision-language models on natural adversarial samples

Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models

Bridging the reality gap: A benchmark for physical reasoning in general world models with various physical phenomena beyond mechanics

A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models

50 Shades of Deceptive Patterns: A Unified Taxonomy, Multimodal Detection, and Security Implications

Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No!

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models