arXiv:2602.20078v3 Announce Type: replace-cross
Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent’s learning signal is computed from a shared return that depends on all agents, so the stochasticity of the other agents enters the signal as cross-agent noise that grows with $N$. Fortunately, many engineering systems, such as cloud computing and power systems, have differentiable analytical models that prescribe efficient system states, providing a new reference beyond noisy shared returns. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that augments policy-gradient updates with a noise-free descent signal derived from differentiable analytical models. We prove that DG-PG reduces policy-gradient estimator variance from $mathcalO(N)$ to $mathcalO(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $widetildemathcalO (1/epsilon)$. On a heterogeneous cloud resource scheduling task with up to 1500 agents, DG-PG converges within 20 episodes on average, while MAPPO and IPPO fail to converge under identical architectures.
Digital health tools and point solutions—pitfalls in population health program measurement
Digital health tools are generally poorly regulated and often lack strong research evidence, posing challenges for purchasers of point solutions such as employer groups and