Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 83 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 444 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Continuous-Time Policy Improvement Algorithm

Updated 24 September 2025
  • The Policy Improvement Algorithm is a method in stochastic control that iteratively refines Markov policies to achieve non-decreasing value functions and converge to an optimal policy.
  • It uses an improvement step based on maximizing the sum of the infinitesimal generator and running reward, paralleling the Hamilton–Jacobi–Bellman framework.
  • The approach leverages a weak formulation to manage diffusion control problems, integrating PDE methods and martingale techniques for convergence analysis.

A Policy Improvement Algorithm (PIA) is a class of methods in stochastic control and reinforcement learning focused on constructing a sequence of policies, each guaranteed to achieve at least as high a value as the previous, ultimately converging to an optimal policy. In continuous-time stochastic control, the PIA comprises iteratively updating Markov controls according to a local optimality condition derived from the structure of the controlled process’s infinitesimal generator, ensuring monotonic convergence even in settings where controls take values in uncountable sets and probability law uniqueness is subtle. The weak formulation of stochastic control is central in this framework, as it accommodates situations where policy-induced stochastic differential equations (SDEs) lack strong solutions. This approach connects policy iteration algorithms to the analysis of partial differential equations (PDEs) and martingale problems.

1. Generalization of PIA to Continuous-Time Control

The classical PIA, prevalent in discrete-time and countable state-action frameworks, is extended to a broad continuous-time paradigm with possibly uncountable actions in compact metric spaces. The main goal is to generate a sequence of Markov policies (Tn)nN(T_n)_{n \in \mathbb{N}} such that the associated payoff sequence (V(Tn))(V^{(T_n)}) is non-decreasing:

V(Tn+1)(x)V(Tn)(x)xS,V^{(T_{n+1})}(x) \geq V^{(T_n)}(x) \quad \forall x \in S,

and converges pointwise to the value function VV of the control problem. This is achieved without resorting to time discretization.

2. Continuous-Time Setting and the Role of Weak Formulation

In continuous time, the strong formulation—where a fixed probability space and a strictly specified filtration are required—becomes inadequate for many natural control policies. The weak formulation allows the controlled process to be realized on potentially different probability spaces for different controls, with policies defined as mappings T:SAT: S \to A. For instance, the policy T(x)=sgn(x)T(x) = \mathrm{sgn}(x) for the SDE dXt=adVtdX_t = a\, dV_t leads to a law equivalent to Brownian motion but does not admit a strong solution, necessitating the weak approach. This flexibility is critical for including Markov policies leading to non-existence of strong solutions but still yielding well-defined controlled processes in distribution.

This setting departs from classical fixed-filtration pathwise uniqueness assumptions, and instead develops convergence theory for control by considering admissible processes and controls as classes defined up to equivalence in law.

3. Algorithmic Structure and Improvement Step

An improvable Markov policy TT is defined by the regularity of its payoff V(T)V^{(T)} (belonging to a function class C\mathcal{C}). The policy improvement step is defined by

T(x)argmaxaA{LaV(T)(x)+f(x,a)},T'(x) \in \arg\max_{a \in A} \left\{ \mathcal{L}^a V^{(T)}(x) + f(x,a) \right\},

where La\mathcal{L}^a is the infinitesimal generator associated with the controlled process at action aa: Lah(x)=12Tr(σ(x,a)σ(x,a)Hh(x))+μ(x,a)h(x)α(x,a)h(x).\mathcal{L}^a h(x) = \tfrac{1}{2} \operatorname{Tr}(\sigma(x,a)\sigma(x,a)^\top Hh(x)) + \mu(x,a)^\top \nabla h(x) - \alpha(x,a) h(x). The running reward is f(x,a)f(x,a). This step is structurally parallel to dynamic programming via the Hamilton–Jacobi–BeLLMan (HJB) equation. Under the main structural assumptions (As1–As8), which ensure attainability of the supremum, regularity, and precompactness in C(S,A)C(S,A), there is:

  • Monotonicity: each policy improvement yields V(Tn+1)(x)V(Tn)(x)V^{(T_{n+1})}(x) \geq V^{(T_n)}(x).
  • Convergence: (V(Tn))(V^{(T_n)}) converges pointwise to the value function VV.
  • Existence of a convergent subsequence (on compacts) of policies to an optimal Markov policy TT^*: V(T)(x)=V(x)V^{(T^*)}(x) = V(x).

4. Diffusion Control Problems and Generator Structure

The paper establishes practical applicability of the abstract PIA framework to controlled Itô diffusions, where the generator plays a central role. Specifically,

Lah(x)=12Tr(σ(x,a)σ(x,a)Hh(x))+μ(x,a)h(x)α(x,a)h(x).\mathcal{L}^a h(x) = \frac{1}{2} \operatorname{Tr}(\sigma(x,a)\sigma(x,a)^\top Hh(x)) + \mu(x,a)^\top \nabla h(x) - \alpha(x,a) h(x).

Key sufficient conditions include compact action sets, uniform ellipticity, and Lipschitz regularity of coefficients. These guarantee existence, regularity of solutions, and enable verification of (As1)–(As8). The framework is particularly advantageous in such problems as it obviates the need for discretization and leverages PDE methods for bounds and verification theorems, even when strong uniqueness fails. Nevertheless, ensuring continuity and regularity of value functions—the technical heart of the approach—requires careful analysis and, in some cases, additional external results on the continuity of value functions for controlled diffusions.

5. Theoretical Considerations and Convergence Results

The convergence analysis interweaves martingale properties, generators, and PDE techniques. The improvement step is aligned with the structure of the HJB equation and verified under high-level assumptions about the Markov policy class and the regularity of payoffs. The sequence (V(Tn))(V^{(T_n)}) is shown to converge monotonically, and a subsequence of policies converges uniformly on compacts to the optimal policy—guaranteeing the algorithm’s efficacy for a general class of continuous-time control problems. This provides a strong theoretical link between functional analytic approaches to control and iterative construction of optimal policies.

6. Practical Implementation, Limitations, and Extension

Practical usage of continuous-time PIA necessitates:

  • Ensuring that improvement and evaluation steps return payoffs within the chosen function space (typically C2C^2 or continuously differentiable functions).
  • Careful verification of the relevant assumptions (e.g., compactness, regularity, Lipschitz continuity).
  • Addressing technical nontrivialities in regularity—additional results (cited as [5] and [6] in the original work) are required for the necessary continuity of value iterations before standard PDE tools can be applied.
  • Recognizing the foundational importance of the weak formulation for including controls that generate non-uniqueness or lack of strong solutions in SDEs.

For diffusion models, the structure of the generator supports the exploitation of PDE methods, yielding direct application and integration with established classical analysis.

7. Significance and Outlook

The general framework for PIA in continuous-time stochastic control unifies and generalizes earlier discrete-time policy improvement results to broad, weakly formulated settings. It facilitates inclusion of nonstandard, possibly non-smooth policies, as well as addressing cases where the strong solution concept is inapplicable. In diffusion-driven control applications, the PIA provides both a rigorous theoretical foundation and a practical iterative procedure for policy optimization, with convergence and regularity results that do not require time discretization. This positions the continuous-time PIA as a cornerstone for both theoretical investigation and implementation in high-dimensional, weakly-posed stochastic control problems, extending the reach of classical dynamic programming and opening new avenues for infinite-dimensional and weak-control settings (Jacka et al., 2015).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Policy Improvement Algorithm.