arXiv:2603.00523v2 Announce Type: replace-cross
Abstract: Every mechanistic circuit carries an invisible
asterisk: it reflects not just the model’s
computation, but the analyst’s choice of
pruning threshold. Change that choice and the
circuit changes, yet current practice treats a
single pruned subgraph as ground truth with
no way to distinguish robust structure from
threshold artifacts. We introduce CIRCUS,
which reframes circuit discovery as a problem
of uncertainty over explanations. CIRCUS
prunes one attribution graph under B
configurations, assigns each edge an empirical
inclusion frequency s(e) in [0,1] measuring
how robustly it survives across the
configuration family, and extracts a consensus
circuit of edges present in every view. This
yields a principled core/contingent/noise
decomposition (analogous to posterior
model-inclusion indicators in Bayesian
variable selection) that separates robust
structure from threshold-sensitive artifacts,
with negligible overhead. On Gemma-2-2B and
Llama-3.2-1B, consensus circuits are 40x
smaller than the union of all configurations
while retaining comparable influence-flow
explanatory power, consistently outperform
influence-ranked and random baselines, and are
confirmed causally relevant by activation
patching.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844