arXiv:2604.25907v1 Announce Type: cross
Abstract: Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q=0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q=1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_theta^-q$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Omega(frac1p_0)$ time to escape cold start, while the density-estimation pole escapes in $Thetabig(log(frac1p_0)big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_theta$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $Obig(fracqM P_theta^q+1big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q=0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q=0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).
Disclosure in the era of generative artificial intelligence
Generative artificial intelligence (AI) has rapidly become embedded in academic writing, assisting with tasks ranging from language editing to drafting text and producing evidence. Despite

