• Home
  • Uncategorized
  • Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

arXiv:2603.05773v2 Announce Type: replace-cross
Abstract: Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the textbfunderlineDisentangled textbfunderlineSafety textbfunderlineHypothesis textbf(DSH), positing that safety computation operates on two distinct subspaces: a textitRecognition Axis ($mathbfv_H$, “Knowing”) and an textitExecution Axis ($mathbfv_R$, “Acting”). Our geometric analysis reveals a universal “Reflex-to-Dissociation” evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce textitDouble-Difference Extraction and textitAdaptive Causal Steering. Using our curated textscAmbiguityBench, we demonstrate a causal double dissociation, effectively creating a state of “Knowing without Acting.” Crucially, we leverage this disentanglement to propose the textbfRefusal Erasure Attack (REA), which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the textitExplicit Semantic Control of Llama3.1 with the textitLatent Distributed Control of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844