Large‑investor mean–variance RL with price impact and unobservable counterfactuals
Develop continuous‑time reinforcement learning frameworks and theoretical guarantees for mean–variance portfolio selection by large investors whose trading actions impact asset prices and factors, addressing the challenge that counterfactual wealth trajectories under alternative portfolios cannot be inferred from observed price paths due to endogenous price impact.
References
In the MV setting, important open questions include performance guarantees of modified online algorithms, improvement of regret bound, off-policy learning, and large investors whose actions impact the asset prices (so counterfactuals become unobservable by mere “paper portfolios”).
                — Mean--Variance Portfolio Selection by Continuous-Time Reinforcement Learning: Algorithms, Regret Analysis, and Empirical Study
                
                (2412.16175 - Huang et al., 8 Dec 2024) in Section 6 (Conclusions)