• Home
  • Uncategorized
  • Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

arXiv:2605.19768v1 Announce Type: new
Abstract: We study reinforcement learning for episodic Markov Decision Processes (MDPs) whose transitions are modelled by a multinomial logistic (MNL) model. Existing algorithms for MNL mixture MDPs yield a regret of $smashtildeO(dH^2sqrtT)$ (Li et al., 2024), where $d$ is the feature dimension, $H$ the episode length, and $T$ the number of episodes. Inspired by the logistic bandit literature (Abeille et al., 2021; Faury et al., 2022; Boudart et al., 2026), we introduce a problem-dependent constant $barsigma_T leq 1/2$, measuring the normalised average variance of the optimal downstream value function along the learner’s trajectory. We propose an algorithm achieving a regret of $smashtildeO(dH^2barsigma_TsqrtT)$, which recovers the existing bound in the worst case and improves upon it for structured MDPs. For instance, for KL-constrained robust MDPs, $barsigma_T = O(H^-1)$, reducing the horizon dependence by a factor $H$. We further establish a matching $smashOmega(dH^2barsigma_TsqrtT)$ lower bound, proving minimax optimality (up to logarithmic factors) and fully characterising the regret complexity of MNL mixture MDPs for the first time.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844