arXiv:2505.06518v3 Announce Type: replace
Abstract: In many real-world planning tasks, agents must tackle uncertainty about the environment’s state and variability in the outcomes induced by stochastic dynamics and rewards. Motivated by recent progress in world model approaches, where latent models approximate beliefs and support planning, we extend Distributional Reinforcement Learning (DistRL), which models the entire return distribution for fully observable domains, to Partially Observable Markov Decision Processes (POMDPs). Concretely, we introduce new distributional Bellman operators for partial observability and prove their convergence under the supremum p-Wasserstein metric. We also propose a finite representation of these return distributions via psi-vectors, generalizing the classical alpha-vectors in POMDP solvers. Building on this, we develop Distributional Point-Based Value Iteration (DPBVI), which integrates psi-vectors into a standard point-based backup procedure, bridging DistRL and POMDP planning. Our experiments demonstrate that DPBVI recovers classical Point-Based Value Iteration (PBVI) in the risk-neutral case, validating the distributional extension.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844