arXiv:2604.05134v2 Announce Type: replace-cross
Abstract: We study how reasoning evolves in a language model — from supervised fine-tuning (SFT) to reinforcement learning (RL) — by analyzing how a set of theoretically-inspired datasets influences language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance — however, the RL stage elicits textitunfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We analyze multiple qualitative and quantitative measures and highlight how these evolve from SFT through RL; we find several SFT-checkpoint metrics — spanning evaluation performance, hallucination rates, and reasoning quality — to be predictive of post-RL model performance. Finally, we ground our results with an experiment measuring textitchess information density in our custom datasets. We release models as well as training data, evaluations, and code that allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model. Code, models, and data are available at https://github.com/lucasdino/lang-chess.
Crisis support teams’ technological openness and learning attitudes toward the AI based virtual patient system crisis support VR
BackgroundAgainst the backdrop of escalating global humanitarian crises, innovative didactic simulations are becoming increasingly important. A promising alternative to traditional classroom-based didactics for learning psychological