• Home
  • Uncategorized
  • How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess

arXiv:2604.05134v2 Announce Type: replace-cross
Abstract: We study how reasoning evolves in a language model — from supervised fine-tuning (SFT) to reinforcement learning (RL) — by analyzing how a set of theoretically-inspired datasets influences language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance — however, the RL stage elicits textitunfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We analyze multiple qualitative and quantitative measures and highlight how these evolve from SFT through RL; we find several SFT-checkpoint metrics — spanning evaluation performance, hallucination rates, and reasoning quality — to be predictive of post-RL model performance. Finally, we ground our results with an experiment measuring textitchess information density in our custom datasets. We release models as well as training data, evaluations, and code that allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model. Code, models, and data are available at https://github.com/lucasdino/lang-chess.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844