• Home
  • Uncategorized
  • Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

arXiv:2604.00770v1 Announce Type: cross
Abstract: A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving
no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a
single embedding vector at the input layer; the model’s own multi-pass reasoning amplifies this perturbation into a
hijacked latent trajectory that reliably produces the attacker’s chosen answer, while remaining structurally
invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks,
and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline
clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active
defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural
Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why
defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a
striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the
wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing
backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints
are available.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844