arXiv:2510.26418v4 Announce Type: replace
Abstract: Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844