arXiv:2603.10219v1 Announce Type: cross
Abstract: We study a continuous-time diffusion approximation of policy gradient for $k$-armed stochastic bandits. We prove that with a learning rate $eta = O(Delta^2/log(n))$ the regret is $O(k log(k) log(n) / eta)$ where $n$ is the horizon and $Delta$ the minimum gap. Moreover, we construct an instance with only logarithmically many arms for which the regret is linear unless $eta = O(Delta^2)$.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844