arXiv:2601.17332v1 Announce Type: new
Abstract: The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce textbfTheoremForge, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are textitstatement formalization, textitproof generation, textitpremise selection, textitproof correction and textitproof sketching. By implementing a textitDecoupled Extraction Strategy, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6%, surpassing the 8.6% baseline, at an average cost of only textbf$0.481 per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by textbf1.6$times$ for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available hrefhttps://github.com/timechess/TheoremForgehere.




