Prism: A framework for decoupling and assessing the capabilities of vlms

Y Qiao, H Duan, X Fang, J Yang… - Advances in …, 2025 - proceedings.neurips.cc
Abstract Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing
a wide array of visual questions, which requires strong perception and reasoning faculties …

A peek into token bias: Large language models are not yet genuine reasoners

B Jiang, Y **e, Z Hao, X Wang, T Mallick, WJ Su… - arxiv preprint arxiv …, 2024 - arxiv.org
This study introduces a hypothesis-testing framework to assess whether large language
models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We …

Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam

NC Mendonça - ACM Transactions on Computing Education, 2024 - dl.acm.org
The recent integration of visual capabilities into Large Language Models (LLMs) has the
potential to play a pivotal role in science and technology education, where visual elements …

What is the visual cognition gap between humans and multimodal llms?

X Cao, B Lai, W Ye, Y Ma, J Heintz, J Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, Multimodal Large Language Models (MLLMs) have shown great promise in
language-guided perceptual tasks such as recognition, segmentation, and object detection …

Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?

A Wüst, T Tobiasch, L Helff, DS Dhami… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, newly developed Vision-Language Models (VLMs), such as OpenAI's GPT-4o,
have emerged, seemingly demonstrating advanced reasoning capabilities across text and …

Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything

X Zou, K Li, Y Chen - arxiv preprint arxiv:2407.02534, 2024 - arxiv.org
Large Visual Language Model\textbfs (VLMs) such as GPT-4V have achieved remarkable
success in generating comprehensive and nuanced responses. Researchers have …

VisFactor: Benchmarking Fundamental Visual Cognition in Multimodal Large Language Models

JT Huang, D Dai, JY Huang, Y Yuan, X Liu… - arxiv preprint arxiv …, 2025 - arxiv.org
Multimodal Large Language Models (MLLMs) have demonstrated remarkable
advancements in multimodal understanding; however, their fundamental visual cognitive …

Visual scratchpads: Enabling global reasoning in vision

A Lotfi, E Fini, S Bengio, M Nabi, E Abbe - arxiv preprint arxiv:2410.08165, 2024 - arxiv.org
Modern vision models have achieved remarkable success in benchmarks where local
features provide critical information about the target. There is now a growing interest in …

Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

M Hersche, G Camposampiero, R Wattenhofer… - arxiv preprint arxiv …, 2024 - arxiv.org
This work compares large language models (LLMs) and neuro-symbolic approaches in
solving Raven's progressive matrices (RPM), a visual abstract reasoning test that involves …

Benchmarking Visual Cognition of Multimodal LLMs via Matrix Reasoning

X Cao, B Lai, W Ye, Y Ma, J Heintz, M Huang, J Chen… - openreview.net
Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models
(VLMs) have shown great promise in language-guided perceptual tasks such as recognition …