• Home
  • Uncategorized
  • Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

arXiv:2603.25412v2 Announce Type: replace
Abstract: Large language models increasingly rely on explicit chain-of-thought reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work focuses predominantly on content safety (i.e., detecting harmful, biased, or factually incorrect outputs), while treating the underlying reasoning chain as an opaque intermediate artifact. We argue that reasoning safety constitutes a fundamental security dimension orthogonal to content safety: the requirement that a model’s reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. In this paper, we formalize reasoning safety and introduce a systematic taxonomy of nine unsafe reasoning behaviors. We then conduct a large-scale prevalence study, annotating over 4,000 reasoning chains across benign benchmarks and four state-of-the-art reasoning attacks, empirically demonstrating that all nine error types occur in practice with mechanistically interpretable signatures. To mitigate these threats, we propose the Reasoning Safety Monitor: an external, zero-shot verification framework that runs in parallel with the target LLM. It inspects each reasoning step in real time via a taxonomy-embedded prompt and dispatches an interrupt signal upon detecting unsafe behavior. Extensive evaluations show our monitor achieves up to 87.11% step-level localization accuracy, outperforming hallucination detectors and the best process reward model baselines by a substantial margin. Crucially, the monitor maintains a low false positive rate on correct reasoning paths, operates with negligible latency overhead, and exhibits robust resilience against adaptive adversarial evasion. These findings establish reasoning safety monitoring as a highly feasible and essential component for the secure deployment of large reasoning models.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844