Malliavin Policy-Gradient Decomposition

Updated 15 June 2026

The paper introduces a variance-reduced estimator based on Malliavin calculus for efficient gradient estimation in rare-event scenarios.
It exploits weak derivative decomposition and common-random-number coupling to achieve unbiased gradient estimates with constant variance.
The framework applies to continuous-time reinforcement learning, enabling risk-sensitive and constraint-satisfying policy improvement.

Malliavin policy-gradient decomposition refers to a methodology for estimating the gradient of counterfactual performance measures in controlled diffusion processes, leveraging Malliavin calculus, weak derivatives, and a two-stage variance-reduced scheme. The paradigm was introduced for the setting where conditional loss functionals of diffusion processes are estimated with respect to stochastic model parameters, particularly in rare-event or constraint-conditioned regimes where naive estimators are highly inefficient and standard kernel smoothing approaches exhibit prohibitively slow convergence (Krishnamurthy et al., 30 Sep 2025).

1. Problem Setting and Motivation

Consider a probability space $(\Omega, \mathcal{F}, \mathbb{P})$ supporting a $d$ -dimensional Brownian motion $W$ and its filtration $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ . A one-parameter family of controlled stochastic differential equations (SDEs) is defined by

$dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$

for parameter $\theta \in \Theta \subset \mathbb{R}^p$ . The objective is to study conditional expectations of the form

$L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$

where $\ell(X^\theta)$ denotes a reward or cost functional, and $g(X^\theta)$ is a path-functional imposing a conditioning constraint—typically a rare event or terminal requirement. The challenge arises because events $\{g(X^\theta)=0\}$ may have vanishing or exponentially small probability, making straightforward Monte Carlo estimation infeasible. The use of Malliavin calculus and weak derivatives enables an exact, kernel-free representation that achieves both computational tractability and statistically efficient estimation, even in rare-event regimes (Krishnamurthy et al., 30 Sep 2025).

2. Malliavin Calculus Representations

Conditional expectations under singular events are formally written as

$d$ 0

introducing the multidimensional Dirac delta $d$ 1 and following the convention in stochastic analysis. The Malliavin derivative $d$ 2, and the Skorohod (divergence) integral $d$ 3, as in Nualart [20], are central objects. Malliavin integration-by-parts yields the duality formula:

$d$ 4

for $d$ 5 and admissible $d$ 6. Selecting a process $d$ 7 normalized by

$d$ 8

allows the numerator and denominator of $d$ 9 to be rewritten as \begin{align*} N(\theta) &= \mathbb{E}[1_{g>0}\big(\ell(X^\theta)\, \delta(u^\theta) - \int_0^T D_t \ell(X^\theta) u_t^{\theta\,dt\big)],} \ D(\theta) &= \mathbb{E}[1_{g>0}\,\delta(u^\theta)]. \end{align*} This representation expresses the conditional expectation exactly as a ratio of Skorohod-integral forms, where the variance is comparable to classical Monte Carlo, even in rare-event regimes. Canonical choices for $W$ 0 include

$W$ 1

assuming almost-everywhere non-degeneracy of $W$ 2 (Krishnamurthy et al., 30 Sep 2025).

3. Weak Derivative Gradient Decomposition

To compute $W$ 3, the quotient rule yields:

$W$ 4

with $W$ 5, $W$ 6 as defined previously. Each gradient term $W$ 7 for a path-functional $W$ 8 is handled without resorting to classical score-function (likelihood-ratio) estimators. Instead, discretizing the SDE, the weak derivative of the transition kernel $W$ 9 (Gaussian) admits a Hahn–Jordan decomposition:

$\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 0

where $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 1, $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 2 are probability measures and $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 3 is a scalar weight. Thus, for any bounded $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 4,

$\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 5

with both $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 6 and $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 7 sharing Gaussian increments post-branching (“common-random-number coupling”). This procedure enables unbiased and statistically efficient estimation of gradient terms required in the policy-gradient formula (Krishnamurthy et al., 30 Sep 2025).

4. Variance Characteristics and Theoretical Guarantees

The Malliavin-weak-derivative estimator satisfies a constant-variance property. Under the conditions that $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 8, $\{\mathcal{F}_t\}_{0 \leq t \leq T}$ 9 are $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 0 in $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 1 and $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 2, and $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 3, $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 4 with square-integrable Malliavin derivatives, the estimator for $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 5 constructed via the Hahn–Jordan/branching scheme is unbiased and satisfies

$dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 6

This stands in sharp contrast to the classical score-function (likelihood-ratio) estimator,

$dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 7

whose variance grows linearly with the time horizon, i.e., $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 8. This variance reduction is particularly significant for long-horizon, rare-event-conditional, or high-dimensional stochastic systems, as substantiated in (Krishnamurthy et al., 30 Sep 2025) and prior analyses [Pflug ’96], [Heidergott–Vazquez–Wiener ’08], [Krishnamurthy–Snow ’24].

5. Practical Implementation Workflow

The method is operationalized by discretizing the SDE using the Euler scheme with stepsize $dX_t^\theta = b_\theta(X_t^\theta, t)\,dt + \sigma(X_t^\theta, t)\,dW_t, \quad t \in [0, T], \quad X_0^\theta = x_0 \in \mathbb{R}^n,$ 9. At an adaptively chosen time step $\theta \in \Theta \subset \mathbb{R}^p$ 0:

Simulate the path $\theta \in \Theta \subset \mathbb{R}^p$ 1 under nominal $\theta \in \Theta \subset \mathbb{R}^p$ 2.
Branch the path at $\theta \in \Theta \subset \mathbb{R}^p$ 3: generate $\theta \in \Theta \subset \mathbb{R}^p$ 4 and $\theta \in \Theta \subset \mathbb{R}^p$ 5.
Propagate both $\theta \in \Theta \subset \mathbb{R}^p$ 6 and $\theta \in \Theta \subset \mathbb{R}^p$ 7 forward using identical Brownian increments ( $\theta \in \Theta \subset \mathbb{R}^p$ 8, common random numbers).
Compute $\theta \in \Theta \subset \mathbb{R}^p$ 9: for $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 0, where $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 1 and $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 2.
Average $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 3; assemble $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 4 via the quotient rule.

This construction underpins a unified policy-gradient approach that is robust to long horizons and rare-event conditioning, and does not require kernel smoothing or likelihood ratio tricks (Krishnamurthy et al., 30 Sep 2025).

6. Specialization to Reinforcement Learning and Broader Impact

In continuous-time reinforcement learning, parameter $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 5 is interpreted as parametrizing the drift $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 6 of the policy SDE. The expected return $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 7 admits a Malliavin weak-derivative representation:

$L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 8

with branching as described earlier. This “one-branch” gradient estimator maintains $L(\theta) = \mathbb{E}\left[\ell(X^\theta)\,\big|\,g(X^\theta)=0\right],$ 9 variance, outperforming the standard REINFORCE estimator, whose variance scales as $\ell(X^\theta)$ 0. The methodology extends directly to rare-event-conditional or constrained RL objectives. This suggests that the Malliavin policy-gradient decomposition constitutes an efficient framework for counterfactual, risk-sensitive, or constraint-satisfying policy improvement in stochastic control and reinforcement learning, particularly in domains demanding tractable rare-event analysis or robust performance under pathwise constraints (Krishnamurthy et al., 30 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Malliavin Calculus with Weak Derivatives for Counterfactual Stochastic Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Malliavin Policy-Gradient Decomposition.

Malliavin Policy-Gradient Decomposition

1. Problem Setting and Motivation

2. Malliavin Calculus Representations

3. Weak Derivative Gradient Decomposition

4. Variance Characteristics and Theoretical Guarantees

5. Practical Implementation Workflow

6. Specialization to Reinforcement Learning and Broader Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Malliavin Policy-Gradient Decomposition

1. Problem Setting and Motivation

2. Malliavin Calculus Representations

3. Weak Derivative Gradient Decomposition

4. Variance Characteristics and Theoretical Guarantees

5. Practical Implementation Workflow

6. Specialization to Reinforcement Learning and Broader Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research