Off‑policy learning for continuous‑time mean–variance reinforcement learning
Develop and analyze off‑policy learning theory and algorithms for continuous‑time mean–variance reinforcement learning, including policy evaluation and policy gradient methods when training data are generated under behavior policies different from the target policies to be executed.
References
In the MV setting, important open questions include performance guarantees of modified online algorithms, improvement of regret bound, off-policy learning, and large investors whose actions impact the asset prices (so counterfactuals become unobservable by mere “paper portfolios”).
                — Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study
                
                (2412.16175 - Huang et al., 8 Dec 2024) in Section 6 (Conclusions)