Behavior change beyond intervention: an activity-theoretical perspective on human-centered design of personal health technology

IntroductionModern personal technologies, such as smartphone apps with artificial intelligence (AI) capabilities, have a significant potential for helping people make necessary changes in their behavior

A data-centric perspective on designing AI foundation models for healthcare

Post Content

Correction: AI-Enabled Wearables for Motor Function Assessment and Rehabilitation in Parkinson Disease: Scoping Review

Post Content

Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy

arXiv:2604.17182v1 Announce Type: cross Abstract: In LLM-based code generation, multiple code candidates are often generated in parallel from the same prompt — for example, in

HQA-VLAttack: Towards High Quality Adversarial Attack on Vision-Language Pre-Trained Models

arXiv:2604.16499v1 Announce Type: cross Abstract: Black-box adversarial attack on vision-language pre-trained models is a practical and challenging task, as text and image perturbations need to

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

April 21, 2026

arXiv:2604.18510v1 Announce Type: cross
Abstract: Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844