Off‑policy learning for continuous‑time mean–variance reinforcement learning

Develop and analyze off‑policy learning theory and algorithms for continuous‑time mean–variance reinforcement learning, including policy evaluation and policy gradient methods when training data are generated under behavior policies different from the target policies to be executed.

Background

The paper trains with stochastic policies but recommends deterministic execution, noting this setup as a form of off‑policy learning. While this is motivated and discussed, the authors explicitly list off‑policy learning as an open question, indicating the need for formal theory and guarantees in the continuous‑time MV context.

References

In the MV setting, important open questions include performance guarantees of modified online algorithms, improvement of regret bound, off-policy learning, and large investors whose actions impact the asset prices (so counterfactuals become unobservable by mere “paper portfolios”).

— Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study (2412.16175 - Huang et al., 8 Dec 2024) in Section 6 (Conclusions)

Off‑policy learning for continuous‑time mean–variance reinforcement learning

Sponsor

Background

References

Related Problems