Physics-Informed Policy Iteration (PINN-PI)

Updated 3 January 2026

Physics-Informed Policy Iteration (PINN-PI) is a mesh-free framework that integrates physics-informed neural networks with dynamic programming to solve nonlinear optimal and stochastic control PDEs.
It alternates between neural network-based policy evaluation and improvement using automatic differentiation and KL-divergence loss, ensuring convergence with rigorous L2 error bounds.
PINN-PI demonstrates scalability and efficiency for high-dimensional systems, extending to robust control, stochastic differential games, and PDE-constrained applications.

Physics-Informed Policy Iteration (PINN-PI) is a mesh-free algorithmic framework that leverages physics-informed neural networks (PINNs) to solve nonlinear optimal control, stochastic control, and differential game problems posed as partial differential equations (PDEs) of Hamilton–Jacobi–Bellman (HJB) or Hamilton–Jacobi–Isaacs (HJI) type. PINN-PI systematically alternates between neural network-based policy evaluation and improvement, integrating automatic differentiation with classical dynamic programming constructs. The approach provides rigorous $L^2$ error bounds, ensures convergence, and demonstrates scalability for high-dimensional problems that are intractable for grid-based solvers.

1. Mathematical Formulation and Policy Iteration Paradigm

PINN-PI is grounded in the controlled diffusion framework and entropy-regularized control. For $d$ -dimensional stochastic systems governed by

$dX_t = \left[\int_U b(X_t,u)\,\pi(X_t,du)\right] dt + \sigma(X_t) dW_t$

the entropy-regularized cost for a relaxed policy $\pi$ is

$V^\pi(x) = \mathbb{E}\left[\int_0^\infty e^{-\rho t} \left(\int_U r(X_t,u)\,\pi(X_t,du) - \lambda\,\mathrm{KL}\left(\pi(X_t,\cdot)\Vert\rho_0\right)\right) dt\right]$

where $\rho_0$ is a reference measure, and $\lambda$ controls the regularization strength (Kim et al., 3 Aug 2025). The optimal value $V(x)=\sup_\pi V^\pi(x)$ solves the nonlinear elliptic entropy-regularized HJB,

$\rho V(x) = \sup_{\pi\in\Pcal(U)} \left\{\tfrac{1}{2}\mathrm{tr}\left(\Sigma(x) D^2 V(x)\right) + \int_U [b(x,u)\cdot\nabla V(x)+r(x,u)]\pi(du) - \lambda\int_U \ln\pi(u)\,\pi(du)\right\}$

Classical policy iteration alternates:

Policy evaluation: Solve the linear PDE for $V^\pi$ given fixed $\pi$ .
Policy improvement: Derive the updated policy by softmax or minimization based on value gradients.

For general stochastic control or differential games, PINN-PI applies to HJB/HJI equations of the form

$\partial_t V + H(x, \nabla V, D^2 V) = 0$

with Hamiltonian $H$ encoding supremum/infinum (or minimax) operators over controls/disturbances (Yang et al., 21 Jul 2025, Wang et al., 26 Aug 2025). In robust control, Zubov-type PDEs are handled using similar max Hamiltonian formulations.

2. PINN-Based Policy Evaluation and Neural Architecture

PINN-PI approximates both the value function $V(x)$ and the policy $\pi(x,u)$ (when needed) with neural networks. Key implementation aspects:

Value networks $v(x;\theta)$ and policy networks $\pi(x,u;\omega)$ parameterized by $\theta$ , $\omega$ .
Policy evaluation: Minimize the mean-squared PDE residual,

$\mathcal{L}_{\text{res}}(\theta) = \frac{1}{N}\sum_{i=1}^N |\mathcal{R}(x_i; \theta, \omega_n)|^2$

with $\mathcal{R}(x_i; \cdot)$ encoding the local HJB/HJI residual (Kim et al., 3 Aug 2025, Meng et al., 2024).

Policy improvement: Fit the network $\pi(x,u;\omega)$ to the analytic target (softmax or argmin) using KL-divergence loss or supervised anchor points (Kim et al., 3 Aug 2025, Wang et al., 26 Aug 2025).
Collocation points $\{x_i\}$ sampled iid in state and $\{u_j\}$ in action space, supporting mesh-free approximation. Networks typically have 4-6 layers, 50-128 units per layer, with $\tanh$ or $\text{ReLU}$ activations (Kim et al., 3 Aug 2025, Mukherjee et al., 2023).

3. Mesh-Free Implementation and Training Procedures

Mesh-free spatial sampling: $N=10^3$ to $2\times 10^4$ collocation points per iteration, batch size $10^3$ .
Automatic differentiation computes gradients and Hessians required for PDE residuals.
Optimization uses Adam with learning rates $10^{-4}$ – $10^{-3}$ .
Alternation: Each PI cycle iteratively refines $\theta$ and $\omega$ until the $L^2$ norm of value update $\|v(\cdot; \theta_{n+1}) - v(\cdot; \theta_n)\|_{L^2}$ falls below tolerance (Kim et al., 3 Aug 2025, Kim et al., 3 Aug 2025).

For high-dimensional systems, direct mesh-based solvers exhibit exponential growth in complexity; PINN-PI maintains tractability across dimensions up to $d=10$ (Meng et al., 2024, Yang et al., 21 Jul 2025).

4. Error Decomposition and Theoretical Guarantees

PINN-PI admits rigorous error bounds via a three-term decomposition:

$\|\tilde v^n - V\|_{L^2(\mathcal{X})} \leq \underbrace{\|\tilde v^n - \hat v^n\|}_{\text{PDE residual error}} + \underbrace{\|\hat v^n - v^n\|}_{\text{policy network error}} + \underbrace{\|v^n - V\|}_{\text{iteration error}}$

Assuming uniform ellipticity, drift bounds, and Lipschitz continuity of the policy update, the total error remains $O(r+q)$ (policy network error plus PINN residual) with exponential convergence in the PI index [(Kim et al., 3 Aug 2025), Thm. 4.1; (Kim et al., 3 Aug 2025); (Yang et al., 21 Jul 2025), Thm. 3.3]. Lipschitz bounds ensure that errors in value gradients propagate linearly to policy updates.

Variants such as ELM-PI apply linear least-squares when the system dimension is small, while PINN-PI (deep networks) generalizes to high-dimensional scenarios (Meng et al., 2024).

5. Extensions to Robust Control and Differential Games

PINN-PI extends to:

Robust region of attraction (RROA), generalized Zubov PDE: Policy iteration alternates linear PDE solve under fixed disturbance and pointwise maximization to find optimal disturbance. Rollout-based anchor points stabilize training to prevent singular flat solutions (Wang et al., 26 Aug 2025).
Stochastic differential games: Policy iteration solves HJI equations via minimax optimization at the policy improvement step. The value function is approximated mesh-free, and controls/disturbances are updated via pointwise saddle-point problems leveraging automatic differentiation (Yang et al., 21 Jul 2025). Equi-Lipschitz properties ensure convergence without convexity.

6. Numerical Benchmarks and Empirical Performance

PINN-PI has demonstrated empirical scalability, accuracy, and robustness:

High-dimensional LQR ( $d=5,10$ ): PINN-PI achieves smooth convergence to near-optimal values, maintaining $L^2$ errors below $10^{-2}$ . SAC (Soft Actor-Critic) requires orders of magnitude more data and plateaus for $d=10$ (Kim et al., 3 Aug 2025, Kim et al., 3 Aug 2025).
Nonlinear benchmarks (pendulum, cartpole): PINN-PI stabilizes in fewer iterations, handles strong noise, and achieves higher rewards than model-free RL algorithms (Kim et al., 3 Aug 2025, Kim et al., 3 Aug 2025).
Fluid-cooled battery packs (1D coupled PDE): Hybrid PINN-PI actor-critic framework attains fourfold improvement in sample efficiency compared to PPO, exploiting physical structure in the value network (Mukherjee et al., 2023).
Robust region of attraction (2D Van der Pol, 10D decoupled): Rollout-enhanced PINN-PI achieves contour-accurate RROA estimates, avoiding singularities and scaling to $d=10$ (Wang et al., 26 Aug 2025).
Differential games (publisher-subscriber, path planning): PINN-PI outperforms direct PINN solvers in accuracy and smoothness, maintaining relative $L^2$ errors $<10^{-2}$ in $d=5,10$ (Yang et al., 21 Jul 2025).

Sample performance metrics:

Benchmark	Dimensionality	PINN-PI $L^2$ error	Reference method	Data efficiency
LQR	$5,10$	$<10^{-2}$	SAC	$\gg$ lower
Cartpole	$2$	Higher reward	SAC	Faster converg.
Zubov (decoupled)	$10$	Accurate contours	FDM	Matched
Publisher-Subscr.	$5,10$	$10^{-2}, 5\times10^{-2}$	Direct PINN	Superior

PINN-PI maintains polynomial runtime scaling in practice, with wall-clock times comparable or superior to SAC and dramatically better than grid-based solvers as $d$ increases (Kim et al., 3 Aug 2025, Meng et al., 2024).

7. Rigorous Verification, Stability, and Application Domains

Stability certification via formal verification (SMT solvers such as dReal) permits Lyapunov-region guarantees for the learned controllers, especially in nonlinear settings (Meng et al., 2024).

PINN-PI is applicable across domains:

High-dimensional stochastic control (robotics, finance)
Robust region estimation in control and safety analysis
Stochastic differential games and multi-agent reinforcement learning
PDE-constrained optimal control in energy, battery management, and path planning

A plausible implication is that direct embedding of physical PDE structure and dynamic programming into neural policy iteration yields scalable, verifiable control solutions robust to dimensionality and model regularization requirements.

References

"Physics-informed approach for exploratory Hamilton--Jacobi--Bellman equations via policy iterations" (Kim et al., 3 Aug 2025)
"Neural Policy Iteration for Stochastic Optimal Control: A Physics-Informed Approach" (Kim et al., 3 Aug 2025)
"Solving nonconvex Hamilton--Jacobi--Isaacs equations with PINN-based policy iteration" (Yang et al., 21 Jul 2025)
"Learning Robust Regions of Attraction Using Rollout-Enhanced Physics-Informed Neural Networks with Policy Iteration" (Wang et al., 26 Aug 2025)
"Physics-Informed Neural Network Policy Iteration: Algorithms, Convergence, and Verification" (Meng et al., 2024)
"Actor-Critic Methods using Physics-Informed Neural Networks: Control of a 1D PDE Model for Fluid-Cooled Battery Packs" (Mukherjee et al., 2023)