arXiv:2605.02141v1 Announce Type: cross
Abstract: Kullback-Leibler (KL) regularization is widely used in offline decision-making and offers several benefits, motivating recent work on the sample complexity of offline learning with respect to KL-regularized performance metrics. Nevertheless, the exact sample complexity of KL-regularized offline learning remains largely from fully characterized. In this paper, we study this question in the setting of multi-armed bandits (MABs). We provide a sharp analysis of KL-PCB (Zhao et al., 2026), showing that it achieves a sample complexity of $tildeO(eta SAC^pi^*/epsilon)$ under large regularization $eta = tildeO(epsilon^-1)$, and a sample complexity of $tildeOmega(SAC^pi^*/epsilon^2)$ under small regularization $eta = tildeOmega(epsilon^-1)$, where $eta$ is the regularization parameter, $S$ is the number of contexts, $A$ is the number of arms, $C^pi^*$ policy coverage coefficient at the optimal policy $pi^*$, $epsilon$ is the desired sub-optimality, and $tildeO$ and $tildeOmega$ hide all poly-logarithmic factors. We further provide a pair of sharper sample complexity lower bounds, which matches the upper bounds over the entire range of regularization strengths. Overall, our results provide a nearly complete characterization of offline multi-armed bandits with KL regularization.
Crisis support teams’ technological openness and learning attitudes toward the AI based virtual patient system crisis support VR
BackgroundAgainst the backdrop of escalating global humanitarian crises, innovative didactic simulations are becoming increasingly important. A promising alternative to traditional classroom-based didactics for learning psychological