Regret and stability guarantees for average‑reward reinforcement learning under nonstationarity

Derive regret or stability bounds for average‑reward reinforcement learning applied to portfolio control with Lipschitz rewards under nonstationary environments that satisfy mixing conditions, in order to provide principled guidance on sample complexity and robustness of the RL layer used in the RL‑BHRP framework.

Background

The RL component of the RL‑BHRP framework employs an average‑reward Markov decision process with a differentiable, factorized softmax policy and a reward that includes returns, transaction costs, and hierarchical risk‑parity penalties. The paper proves a policy‑gradient identity under standard regularity assumptions.

However, the authors explicitly state that generalization guarantees for the RL layer in nonstationary settings are not available. They propose deriving regret or stability bounds under Lipschitz reward assumptions and mixing conditions to quantify sample complexity and robustness when the environment exhibits distributional shifts.

References

While we establish the feasibility of the two-level weight construction and state a policy-gradient identity under standard regularity, several theoretical questions remain open. For the RL layer, generalization guarantees under nonstationarity are not available; deriving regret or stability bounds for average-reward RL with Lipschitz rewards and mixing conditions would provide principled guidance on sample complexity and robustness.

Optimal Portfolio Construction -- A Reinforcement Learning Embedded Bayesian Hierarchical Risk Parity (RL-BHRP) Approach (2508.11856 - Kang et al., 16 Aug 2025) in Section 6.6 (Theoretical Aspects)