• Home
  • Uncategorized
  • Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

arXiv:2506.00286v3 Announce Type: replace-cross
Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $beta neq 0$ controls the agent’s risk attitude: $beta>0$ for risk-averse and $beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|beta|/(1-gamma)$, where $gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|beta|/(1-gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844