Measuring and reducing surgical staff stress in a realistic operating room setting using EDA monitoring and smart hearing protection

BackgroundStress is a critical factor in the operating room (OR) and affects both the performance and well-being of surgical staff. Measuring and mitigating this stress

Gradient-specified optimization based on muscle surface mesh and moment arm as an effect-oriented approach of automated musculotendon path modeling

There is more to musculotendon path modeling than aligning a cable to reflect the geometric features of a muscle-tendon unit. From the perspective of simulation

Adaptation to free-living drives loss of beneficial endosymbiosis through metabolic trade-offs

Symbioses are widespread (1) and underpin the function of diverse ecosystems (2-6), but their evolutionary stability is challenging to explain (7,8). Fitness trade-offs between con-trasting

Hippocampal representations differentiate reactive and anticipatory responses during foraging under threat

Adaptive behavior under threat requires balancing reward pursuit against the risk of harm. During approach-avoidance conflict, animals often pause at decision points, but whether these

Frontal Brain Injury Reduces Sensitivity to Reward-Predictive Cues and Remodels the Nucleus Accumbens

Traumatic brain injuries (TBIs) are more than mere lesions and generate a persistent secondary pathology. This, combined with functional reorganization of circuits post-injury, may explain

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

April 15, 2026

arXiv:2509.25843v2 Announce Type: replace
Abstract: Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a “preventative fine-tuning”, forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844