arXiv:2604.19781v1 Announce Type: cross
Abstract: Automated scoring of student work at scale requires balancing accuracy against cost and latency. In “cascade” systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs — but the challenge is determining which cases to escalate. We explore verbalized confidence — asking the LM to state a numerical confidence alongside its prediction — as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade approached large-LM accuracy (kappa 0.802 vs. 0.819) at 76% lower cost and 61% lower latency. Confidence discrimination is the bottleneck: the two small LMs with meaningful confidence variance yielded cascades with no statistically detectable kappa loss, while the third — whose confidence was near-degenerate — could not close the accuracy gap regardless of threshold. Small LMs with strong discrimination let practitioners trade cost for accuracy along the frontier; those without it do not.
Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers
arXiv:2604.20027v1 Announce Type: cross Abstract: For state-of-the-art image understanding, Vision Transformers (ViTs) have become the standard architecture but their processing diverges substantially from human attentional

