• Home
  • Uncategorized
  • Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

arXiv:2603.18472v2 Announce Type: replace
Abstract: Multimodal large language models (MLLMs) perform strongly on natural images, yet their ability to understand discrete visual symbols remains unclear. We present a multi-domain benchmark spanning language, culture, mathematics, physics and chemistry, organized into three cognitive levels: perception and recognition, combination and reasoning, and association and critical thinking. Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures. These results show that symbolic understanding remains a major bottleneck for multimodal intelligence and motivate training and evaluation schemes that prioritize grounded perception in discrete semantic spaces.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844