arXiv:2604.14243v1 Announce Type: cross
Abstract: Real-world decision-making systems operate in environments where state transitions depend not only on the agent’s actions, but also on textbfexogenous factors outside its control–competing agents, environmental disturbances, or strategic adversaries–formally, $s_h+1 = f(s_h, a_h, bara_h)+omega_h$ where $bara_h$ is the adversary/external action, $a_h$ is the agent’s action, and $omega_h$ is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but textbffail catastrophically in deployment, particularly when safety constraints must be satisfied.
Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the textbfstrategic interaction between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model.
We model the exogenous factor as an textbfadversarial policy $barpi$ that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. emphTo the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics. We propose textbfRobust Hallucinated Constrained Upper-Confidence RL (textttRHC-UCRL), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. textttRHC-UCRL achieves sub-linear regret and constraint violation guarantees.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844