arXiv:2602.06941v2 Announce Type: replace-cross
Abstract: Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., “wait, that’s not right”) and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at llamaseventyEsrRate%, with smaller models from the Llama-3 and Gemma-2 families showing the explicit form less frequently. Two controls dissociate ESR into a detection event and a sustained-resistance component that conditioning on recent on-topic tokens does not fully explain. We identify numOtdLatents SAE latents through contrastive on-topic/off-topic search; zero-ablating them reduces the multi-attempt rate by multiAttemptReductionPct%, with random-latent and held-out-prompt controls supporting specificity. ESR can also be deliberately enhanced through both meta-prompting and fine-tuning on synthetic self-correction examples. ESR has dual implications for safety: it could harden models against adversarial activation-space manipulation, but may equally interfere with beneficial steering-based interventions, since the model has no way to distinguish the two. Code is available at hrefhttps://github.com/agencyenterprise/endogenous-steering-resistancegithub.com/agencyenterprise/endogenous-steering-resistance.
Within-person modeling of postprandial glucose using multimodal wearable data
The widespread adoption of continuous glucose monitoring (CGM) and wearable sensing technologies has enabled large-scale collection of high-resolution physiological and behavioral data in real-world settings.
