arXiv:2408.11513v2 Announce Type: replace-cross
Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entropy and quadratic regularizers to reach this goal. For parameterized policy classes with a transferred compatibility approximation error, $epsilon_mathrmbias$, PDR-ANPG achieves a last-iterate $epsilon$ optimality gap and $epsilon$ constraint violation with a sample complexity of $tildemathcalO(epsilon^-2min\epsilon^-2,epsilon_mathrmbias^-frac13)$. If the class is incomplete ($epsilon_mathrmbias>0$), then the sample complexity reduces to $tildemathcalO(epsilon^-2)$ for $epsilon<(epsilon_mathrmbias)^frac16$. Moreover, for complete policies with $epsilon_mathrmbias=0$, our algorithm achieves a last-iterate $epsilon$ optimality gap and $epsilon$ constraint violation with $tildemathcalO(epsilon^-4)$ sample complexity. It is a significant improvement over the state-of-the-art last-iterate guarantees of general parameterized CMDPs.
A blueprint for using AI to strengthen democracy
Every few centuries, changes in how information moves reshape how societies govern themselves. The printing press spread vernacular literacy, helping give rise to the Reformation
