arXiv:2603.10377v2 Announce Type: replace-cross
Abstract: Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ($n=15$ paired runs), CCG achieves $CFS=5.654pm0.625$, outperforming ROME-style tracing ($3.382pm0.233$), SAE-only ranking ($2.479pm0.196$), and a random baseline ($1.032pm0.034$), with $p<0.0001$ after Bonferroni correction. Learned graphs are sparse (5-6% edge density), domain-specific, and stable across seeds.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844