Malliavin Policy-Gradient Decomposition
- The paper introduces a variance-reduced estimator based on Malliavin calculus for efficient gradient estimation in rare-event scenarios.
- It exploits weak derivative decomposition and common-random-number coupling to achieve unbiased gradient estimates with constant variance.
- The framework applies to continuous-time reinforcement learning, enabling risk-sensitive and constraint-satisfying policy improvement.
Malliavin policy-gradient decomposition refers to a methodology for estimating the gradient of counterfactual performance measures in controlled diffusion processes, leveraging Malliavin calculus, weak derivatives, and a two-stage variance-reduced scheme. The paradigm was introduced for the setting where conditional loss functionals of diffusion processes are estimated with respect to stochastic model parameters, particularly in rare-event or constraint-conditioned regimes where naive estimators are highly inefficient and standard kernel smoothing approaches exhibit prohibitively slow convergence (Krishnamurthy et al., 30 Sep 2025).
1. Problem Setting and Motivation
Consider a probability space supporting a -dimensional Brownian motion and its filtration . A one-parameter family of controlled stochastic differential equations (SDEs) is defined by
for parameter . The objective is to study conditional expectations of the form
where denotes a reward or cost functional, and is a path-functional imposing a conditioning constraint—typically a rare event or terminal requirement. The challenge arises because events may have vanishing or exponentially small probability, making straightforward Monte Carlo estimation infeasible. The use of Malliavin calculus and weak derivatives enables an exact, kernel-free representation that achieves both computational tractability and statistically efficient estimation, even in rare-event regimes (Krishnamurthy et al., 30 Sep 2025).
2. Malliavin Calculus Representations
Conditional expectations under singular events are formally written as
0
introducing the multidimensional Dirac delta 1 and following the convention in stochastic analysis. The Malliavin derivative 2, and the Skorohod (divergence) integral 3, as in Nualart [20], are central objects. Malliavin integration-by-parts yields the duality formula:
4
for 5 and admissible 6. Selecting a process 7 normalized by
8
allows the numerator and denominator of 9 to be rewritten as \begin{align*} N(\theta) &= \mathbb{E}[1_{g>0}\big(\ell(X\theta)\, \delta(u\theta) - \int_0T D_t \ell(X\theta) u_t\theta\,dt\big)], \ D(\theta) &= \mathbb{E}[1_{g>0}\,\delta(u\theta)]. \end{align*} This representation expresses the conditional expectation exactly as a ratio of Skorohod-integral forms, where the variance is comparable to classical Monte Carlo, even in rare-event regimes. Canonical choices for 0 include
1
assuming almost-everywhere non-degeneracy of 2 (Krishnamurthy et al., 30 Sep 2025).
3. Weak Derivative Gradient Decomposition
To compute 3, the quotient rule yields:
4
with 5, 6 as defined previously. Each gradient term 7 for a path-functional 8 is handled without resorting to classical score-function (likelihood-ratio) estimators. Instead, discretizing the SDE, the weak derivative of the transition kernel 9 (Gaussian) admits a Hahn–Jordan decomposition:
0
where 1, 2 are probability measures and 3 is a scalar weight. Thus, for any bounded 4,
5
with both 6 and 7 sharing Gaussian increments post-branching (“common-random-number coupling”). This procedure enables unbiased and statistically efficient estimation of gradient terms required in the policy-gradient formula (Krishnamurthy et al., 30 Sep 2025).
4. Variance Characteristics and Theoretical Guarantees
The Malliavin-weak-derivative estimator satisfies a constant-variance property. Under the conditions that 8, 9 are 0 in 1 and 2, and 3, 4 with square-integrable Malliavin derivatives, the estimator for 5 constructed via the Hahn–Jordan/branching scheme is unbiased and satisfies
6
This stands in sharp contrast to the classical score-function (likelihood-ratio) estimator,
7
whose variance grows linearly with the time horizon, i.e., 8. This variance reduction is particularly significant for long-horizon, rare-event-conditional, or high-dimensional stochastic systems, as substantiated in (Krishnamurthy et al., 30 Sep 2025) and prior analyses [Pflug ’96], [Heidergott–Vazquez–Wiener ’08], [Krishnamurthy–Snow ’24].
5. Practical Implementation Workflow
The method is operationalized by discretizing the SDE using the Euler scheme with stepsize 9. At an adaptively chosen time step 0:
- Simulate the path 1 under nominal 2.
- Branch the path at 3: generate 4 and 5.
- Propagate both 6 and 7 forward using identical Brownian increments (8, common random numbers).
- Compute 9: for 0, where 1 and 2.
- Average 3; assemble 4 via the quotient rule.
This construction underpins a unified policy-gradient approach that is robust to long horizons and rare-event conditioning, and does not require kernel smoothing or likelihood ratio tricks (Krishnamurthy et al., 30 Sep 2025).
6. Specialization to Reinforcement Learning and Broader Impact
In continuous-time reinforcement learning, parameter 5 is interpreted as parametrizing the drift 6 of the policy SDE. The expected return 7 admits a Malliavin weak-derivative representation:
8
with branching as described earlier. This “one-branch” gradient estimator maintains 9 variance, outperforming the standard REINFORCE estimator, whose variance scales as 0. The methodology extends directly to rare-event-conditional or constrained RL objectives. This suggests that the Malliavin policy-gradient decomposition constitutes an efficient framework for counterfactual, risk-sensitive, or constraint-satisfying policy improvement in stochastic control and reinforcement learning, particularly in domains demanding tractable rare-event analysis or robust performance under pathwise constraints (Krishnamurthy et al., 30 Sep 2025).