Convergence theory for PG-DPO and costate estimation in the Two-Stage variant

Investigate the theoretical convergence properties of the Pontryagin-Guided Direct Policy Optimization (PG-DPO) algorithm used for continuous-time multi-asset portfolio optimization, with a specific focus on characterizing the convergence behavior of the costate estimation process produced via backpropagation-through-time in the Two-Stage PG-DPO variant.

Background

The paper introduces Pontryagin-Guided Direct Policy Optimization (PG-DPO), a deep learning framework that leverages Pontryagin’s Maximum Principle to solve high-dimensional continuous-time portfolio optimization problems. PG-DPO uses neural network policies and backpropagation-through-time (BPTT) to estimate costate processes, avoiding the curse of dimensionality inherent in dynamic programming approaches.

A key contribution is the Two-Stage PG-DPO variant, which first warms up the networks to obtain costate estimates and then plugs these estimates into the Pontryagin first-order conditions to derive near-optimal controls analytically. Although extensive numerical experiments demonstrate fast convergence and high accuracy in practice, the paper does not provide a rigorous convergence analysis of the PG-DPO algorithm.

The authors note that derivative stability results are available for certain structured affine models, but similar guarantees for broader, non-affine settings are not established. As such, a formal convergence theory for PG-DPO—particularly for the costate estimation mechanism central to the Two-Stage PG-DPO variant—remains unresolved and is identified as an explicit open question.

References

Furthermore, investigating the theoretical convergence properties of the PG-DPO algorithm, particularly the costate estimation process within the 2-PG-DPO variant, remains an important open question.