Papers
Topics
Authors
Recent
Search
2000 character limit reached

Malliavin Policy-Gradient Decomposition

Updated 15 June 2026
  • The paper introduces a variance-reduced estimator based on Malliavin calculus for efficient gradient estimation in rare-event scenarios.
  • It exploits weak derivative decomposition and common-random-number coupling to achieve unbiased gradient estimates with constant variance.
  • The framework applies to continuous-time reinforcement learning, enabling risk-sensitive and constraint-satisfying policy improvement.

Malliavin policy-gradient decomposition refers to a methodology for estimating the gradient of counterfactual performance measures in controlled diffusion processes, leveraging Malliavin calculus, weak derivatives, and a two-stage variance-reduced scheme. The paradigm was introduced for the setting where conditional loss functionals of diffusion processes are estimated with respect to stochastic model parameters, particularly in rare-event or constraint-conditioned regimes where naive estimators are highly inefficient and standard kernel smoothing approaches exhibit prohibitively slow convergence (Krishnamurthy et al., 30 Sep 2025).

1. Problem Setting and Motivation

Consider a probability space (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) supporting a dd-dimensional Brownian motion WW and its filtration {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}. A one-parameter family of controlled stochastic differential equations (SDEs) is defined by

dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,

for parameter θΘRp\theta \in \Theta \subset \mathbb{R}^p. The objective is to study conditional expectations of the form

L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],

where (Xθ)\ell(X^\theta) denotes a reward or cost functional, and g(Xθ)g(X^\theta) is a path-functional imposing a conditioning constraint—typically a rare event or terminal requirement. The challenge arises because events {g(Xθ)=0}\{g(X^\theta)=0\} may have vanishing or exponentially small probability, making straightforward Monte Carlo estimation infeasible. The use of Malliavin calculus and weak derivatives enables an exact, kernel-free representation that achieves both computational tractability and statistically efficient estimation, even in rare-event regimes (Krishnamurthy et al., 30 Sep 2025).

2. Malliavin Calculus Representations

Conditional expectations under singular events are formally written as

dd0

introducing the multidimensional Dirac delta dd1 and following the convention in stochastic analysis. The Malliavin derivative dd2, and the Skorohod (divergence) integral dd3, as in Nualart [20], are central objects. Malliavin integration-by-parts yields the duality formula:

dd4

for dd5 and admissible dd6. Selecting a process dd7 normalized by

dd8

allows the numerator and denominator of dd9 to be rewritten as \begin{align*} N(\theta) &= \mathbb{E}[1_{g>0}\big(\ell(X\theta)\, \delta(u\theta) - \int_0T D_t \ell(X\theta) u_t\theta\,dt\big)], \ D(\theta) &= \mathbb{E}[1_{g>0}\,\delta(u\theta)]. \end{align*} This representation expresses the conditional expectation exactly as a ratio of Skorohod-integral forms, where the variance is comparable to classical Monte Carlo, even in rare-event regimes. Canonical choices for WW0 include

WW1

assuming almost-everywhere non-degeneracy of WW2 (Krishnamurthy et al., 30 Sep 2025).

3. Weak Derivative Gradient Decomposition

To compute WW3, the quotient rule yields:

WW4

with WW5, WW6 as defined previously. Each gradient term WW7 for a path-functional WW8 is handled without resorting to classical score-function (likelihood-ratio) estimators. Instead, discretizing the SDE, the weak derivative of the transition kernel WW9 (Gaussian) admits a Hahn–Jordan decomposition:

{Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}0

where {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}1, {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}2 are probability measures and {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}3 is a scalar weight. Thus, for any bounded {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}4,

{Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}5

with both {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}6 and {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}7 sharing Gaussian increments post-branching (“common-random-number coupling”). This procedure enables unbiased and statistically efficient estimation of gradient terms required in the policy-gradient formula (Krishnamurthy et al., 30 Sep 2025).

4. Variance Characteristics and Theoretical Guarantees

The Malliavin-weak-derivative estimator satisfies a constant-variance property. Under the conditions that {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}8, {Ft}0tT\{\mathcal{F}_t\}_{0 \leq t \leq T}9 are dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,0 in dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,1 and dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,2, and dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,3, dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,4 with square-integrable Malliavin derivatives, the estimator for dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,5 constructed via the Hahn–Jordan/branching scheme is unbiased and satisfies

dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,6

This stands in sharp contrast to the classical score-function (likelihood-ratio) estimator,

dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,7

whose variance grows linearly with the time horizon, i.e., dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,8. This variance reduction is particularly significant for long-horizon, rare-event-conditional, or high-dimensional stochastic systems, as substantiated in (Krishnamurthy et al., 30 Sep 2025) and prior analyses [Pflug ’96], [Heidergott–Vazquez–Wiener ’08], [Krishnamurthy–Snow ’24].

5. Practical Implementation Workflow

The method is operationalized by discretizing the SDE using the Euler scheme with stepsize dXtθ=bθ(Xtθ,t)dt+σ(Xtθ,t)dWt,t[0,T],X0θ=x0Rn,dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,9. At an adaptively chosen time step θΘRp\theta \in \Theta \subset \mathbb{R}^p0:

  • Simulate the path θΘRp\theta \in \Theta \subset \mathbb{R}^p1 under nominal θΘRp\theta \in \Theta \subset \mathbb{R}^p2.
  • Branch the path at θΘRp\theta \in \Theta \subset \mathbb{R}^p3: generate θΘRp\theta \in \Theta \subset \mathbb{R}^p4 and θΘRp\theta \in \Theta \subset \mathbb{R}^p5.
  • Propagate both θΘRp\theta \in \Theta \subset \mathbb{R}^p6 and θΘRp\theta \in \Theta \subset \mathbb{R}^p7 forward using identical Brownian increments (θΘRp\theta \in \Theta \subset \mathbb{R}^p8, common random numbers).
  • Compute θΘRp\theta \in \Theta \subset \mathbb{R}^p9: for L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],0, where L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],1 and L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],2.
  • Average L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],3; assemble L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],4 via the quotient rule.

This construction underpins a unified policy-gradient approach that is robust to long horizons and rare-event conditioning, and does not require kernel smoothing or likelihood ratio tricks (Krishnamurthy et al., 30 Sep 2025).

6. Specialization to Reinforcement Learning and Broader Impact

In continuous-time reinforcement learning, parameter L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],5 is interpreted as parametrizing the drift L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],6 of the policy SDE. The expected return L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],7 admits a Malliavin weak-derivative representation:

L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],8

with branching as described earlier. This “one-branch” gradient estimator maintains L(θ)=E[(Xθ)g(Xθ)=0],L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],9 variance, outperforming the standard REINFORCE estimator, whose variance scales as (Xθ)\ell(X^\theta)0. The methodology extends directly to rare-event-conditional or constrained RL objectives. This suggests that the Malliavin policy-gradient decomposition constitutes an efficient framework for counterfactual, risk-sensitive, or constraint-satisfying policy improvement in stochastic control and reinforcement learning, particularly in domains demanding tractable rare-event analysis or robust performance under pathwise constraints (Krishnamurthy et al., 30 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Malliavin Policy-Gradient Decomposition.