arXiv:2603.22816v2 Announce Type: replace-cross
Abstract: Language models increasingly “show their work” by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? We introduce step-level faithfulness evaluation – removing one reasoning sentence at a time and checking whether the answer changes – requiring only API access at $1-2 per model per task.
Evaluating 13 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, DeepSeek-R1, Gemini 2.5 Pro, MiniMax-M2.5, Kimi-K2.5, and others) across six domains (sentiment, mathematics, topic classification, medical QA, commonsense, science; N=376-500 each), we find reasoning falls into three modes – not the binary faithful/unfaithful of prior work. In “genuine reasoning,” steps matter and CoT is essential (MiniMax: 37% necessity, +69pp from CoT). In “scaffolding,” CoT helps but steps are interchangeable (Kimi on math: 1% necessity, +94pp from CoT). In “decoration,” CoT adds nothing (DeepSeek-V3.2: 11% necessity, -1pp from CoT).
The DeepSeek family provides causal evidence: R1 reasoning models show 91-93% necessity on math versus 4% for V3.2 – same organisation, proving training objective determines faithfulness. A novel shuffled-CoT mechanistic baseline confirms reasoning-trained models semantically process their steps (7-19pp attention gap) while standard models attend positionally. We also discover “output rigidity” – models that shortcut internally also refuse to explain externally, a blind spot for explanation-based evaluation.
Assessing nurses’ attitudes toward artificial intelligence in Kazakhstan: psychometric validation of a nine-item scale
BackgroundArtificial intelligence (AI) is increasingly integrated into healthcare, yet the attitudes and knowledge of nurses, who are the key mediators of AI implementation, remain underexplored.



