Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep BSDE Method: Differential Learning

Updated 3 February 2026
  • Deep BSDE method is a computational framework that uses deep neural networks and Malliavin calculus to approximate solutions and derivatives of high-dimensional BSDE and corresponding PDEs.
  • The approach employs a differential learning architecture with separate networks for Y, Z, and Γ, ensuring joint optimization of value, gradient, and Hessian estimates.
  • Numerical results show 1–2 orders lower error, significantly improved convergence, and reduced runtimes compared to classical methods in high-dimensional applications.

A deep backward stochastic differential equation (BSDE) method is a class of algorithms that leverage deep neural networks to approximate solutions and their derivatives of high-dimensional nonlinear BSDEs, which are tightly coupled to parabolic partial differential equations (PDEs) via nonlinear Feynman–Kac formulae. The Deep BSDE paradigm is central to modern computational mathematics, mathematical finance, and stochastic control, due to its tractability in hundreds of dimensions and compatibility with Monte Carlo simulation. The "Deep BSDE method" encompasses a broad family of strategies, including the differential-learning techniques described below, which systematically utilize both values and pathwise derivatives of the BSDE process.

1. BSDE Formulation and Malliavin-Lifted System

Consider the decoupled forward–backward SDE system over [0,T][0,T]: Xt=x0+0ta(s,Xs)ds+0tb(s,Xs)dWs, Yt=g(XT)+tTf(s,Xs,Ys,Zs)dstTZsdWs,\begin{aligned} &X_t = x_0 + \int_0^t a(s, X_s)ds + \int_0^t b(s, X_s)dW_s, \ &Y_t = g(X_T) + \int_t^T f(s, X_s, Y_s, Z_s)ds - \int_t^T Z_s dW_s, \end{aligned} where Yt=u(t,Xt)Y_t=u(t,X_t) is the solution field and Zt=xu(t,Xt)b(t,Xt)Z_t=\nabla_x u(t,X_t)\cdot b(t,X_t) its spatial gradient contracted with the volatility coefficient. To ensure that both ZtZ_t and the Hessian Γt=x(xu(t,x)b(t,x))x=Xt\Gamma_t = \nabla_x(\nabla_x u(t,x)\cdot b(t, x))|_{x = X_t} are amenable to network-based learning, the methodology leverages the Malliavin calculus: when one Malliavin-differentiates the backward SDE, an associated Malliavin-lifted system emerges: DsXt=1st[b(s,Xs)+stxa(r,Xr)DsXrdr+stxb(r,Xr)DsXrdWr], DsYt=1st[xg(XT)DsXT+tT(xfDsXr+yfDsYr+zfDsZr)drtTDsZrdWr],\begin{aligned} D_s X_t &= \mathbb{1}_{s \le t}\left[b(s, X_s) + \int_s^t \nabla_x a(r, X_r) D_s X_r dr + \int_s^t \nabla_x b(r, X_r) D_s X_r dW_r\right], \ D_s Y_t &= \mathbb{1}_{s \le t}\left[\nabla_x g(X_T) D_s X_T + \int_t^T (\nabla_x f \cdot D_s X_r + \nabla_y f \cdot D_s Y_r + \nabla_z f \cdot D_s Z_r) dr - \int_t^T D_s Z_r dW_r\right], \end{aligned} with the pathwise identity DtYt=ZtD_t Y_t = Z_t and DsZt=ΓtDsXtD_s Z_t = \Gamma_t D_s X_t (Kapllani et al., 2024).

This formulation makes the infinitesimal dynamics of all relevant sensitivities (value, first and second spatial derivatives) explicit and available for direct optimization during network training.

2. Discretization and Regression Equations

The continuous system is discretized via Euler-Maruyama on a uniform grid 0=t0<<tN=T0 = t_0 < \dots < t_N = T, Δt=tn+1tn\Delta t = t_{n+1} - t_n: Xn+1Δ=XnΔ+a(tn,XnΔ)Δt+b(tn,XnΔ)ΔWn, YnΔ=Yn+1Δ+f(tn,XnΔ,YnΔ,ZnΔ)ΔtZnΔΔWn, DnXmΔ= DnYnΔ=DnYn+1Δ+fD(tn,)ΔtDnZnΔΔWn, DnYnΔ=ZnΔ,DnYn+1Δ=xu(tn+1,Xn+1Δ)DnXn+1Δ, DnZnΔ=ΓnΔDnXnΔ,\begin{aligned} &X_{n+1}^\Delta = X_n^\Delta + a(t_n, X_n^\Delta)\Delta t + b(t_n, X_n^\Delta)\Delta W_n,\ &Y_n^\Delta = Y_{n+1}^\Delta + f(t_n, X_n^\Delta, Y_n^\Delta, Z_n^\Delta)\Delta t - Z_n^\Delta \Delta W_n,\ &D_n X_m^\Delta = \ldots\ &D_n Y_n^\Delta = D_n Y_{n+1}^\Delta + f_D(t_n, \cdots) \Delta t - D_n Z_n^\Delta \Delta W_n,\ &D_n Y_n^\Delta = Z_n^\Delta, \quad D_n Y_{n+1}^\Delta = \nabla_x u(t_{n+1}, X_{n+1}^\Delta) \cdot D_n X_{n+1}^\Delta,\ &D_n Z_n^\Delta = \Gamma_n^\Delta D_n X_n^\Delta, \end{aligned} with fDf_D denoting the appropriate Malliavin-weighted differential of the driver (Kapllani et al., 2024).

The necessity to jointly simulate the processes (Y,Z,Γ)(Y, Z, \Gamma) and their discrete Malliavin increments underlines the challenge: standard Deep BSDE architectures parameterizing only ZZ are insufficient for high-fidelity derivative estimation at t>0t > 0.

3. Differential Deep Learning Architecture

Three independent feed-forward neural networks are introduced: ϕy(t,x;θy)YnΔ, ϕz(t,x;θz)ZnΔ, ϕγ(t,x;θγ)ΓnΔ,\begin{aligned} &\phi^y(t, x;\theta^y) \approx Y_n^\Delta,\ &\phi^z(t, x;\theta^z) \approx Z_n^\Delta,\ &\phi^\gamma(t, x;\theta^\gamma) \approx \Gamma_n^\Delta, \end{aligned} with input (t,x)R×Rd(t, x) \in \mathbb{R} \times \mathbb{R}^d and outputs of shape R\mathbb{R}, R1×d\mathbb{R}^{1\times d}, and Rd×d\mathbb{R}^{d\times d}, respectively—selected for joint approximation of function value, gradient, and Hessian. A typical choice is L=4L=4 layers, η=100+d\eta=100+d neurons per layer, with tanh\tanh activation, for O(Lη2)\mathcal{O}(L\eta^2) parameters.

This architectural split is in contrast to classical Deep BSDE, which typically parameterizes only the control ZZ and the initial value Y0Y_0 through individual networks. Here, the presence and role of Γ\Gamma is explicit and critical, especially for HJB and other fully nonlinear drivers.

4. Differential-Learning Joint Loss Function

A global joint loss function is constructed by combining two local per-step residuals:

  • The YY-loss, which enforces the discrete backward SDE increment:

Ly,Δ(θ)=E[n=0N1Yn+1Δ,θYnΔ,θ+f(tn,)ΔtZnΔ,θΔWn2+YNΔ,θg(XNΔ)2],L^{y,\Delta}(\theta) = \mathbb{E}\left[\sum_{n=0}^{N-1} |Y_{n+1}^{\Delta,\theta} - Y_n^{\Delta,\theta} + f(t_n, \ldots)\Delta t - Z_n^{\Delta,\theta}\Delta W_n|^2 + |Y_N^{\Delta,\theta} - g(X_N^\Delta)|^2 \right],

  • The ZZ-loss, which enforces the Malliavin-increment equation (including pathwise derivatives):

Lz,Δ(θ)=E[n=0N1DnYn+1Δ,θZnΔ,θ+fD(tn,)ΔtΓnΔ,θDnXnΔΔWn2+ZNΔ,θxg(XNΔ)b(T,XNΔ)2].L^{z,\Delta}(\theta) = \mathbb{E}\left[\sum_{n=0}^{N-1} |D_n Y_{n+1}^{\Delta,\theta} - Z_n^{\Delta,\theta} + f_D(t_n, \ldots)\Delta t - \Gamma_n^{\Delta,\theta} D_n X_n^\Delta \Delta W_n|^2 + |Z_N^{\Delta,\theta} - \nabla_x g(X_N^\Delta)b(T,X_N^\Delta)|^2\right].

A convex combination,

LΔ(θ)=ω1Ly,Δ(θ)+ω2Lz,Δ(θ),ω1=1d+1, ω2=dd+1,L^\Delta(\theta) = \omega_1 L^{y,\Delta}(\theta) + \omega_2 L^{z,\Delta}(\theta), \quad \omega_1 = \frac{1}{d+1}, \ \omega_2 = \frac{d}{d+1},

is minimized. This structure enforces both the pathwise evolution and Malliavin derivative constraints at every time step, dramatically increasing the accuracy of ZZ and particularly Γ\Gamma as compared to previous local or terminal-only loss designs (Kapllani et al., 2024).

5. Training Procedure and Implementation

Training is performed by global stochastic optimization (typically Adam) on the joint loss. The input is normalized per time-slice; parameter initialization uses Xavier or He schemes compatible with modern frameworks. Training proceeds across K=60000K=60\,000 steps with batch size $128$.

Gradient computation is handled by automatic differentiation through the (deterministic) Euler–Maruyama simulation and the forward network evaluation, requiring no manual intervention even as the Malliavin-system dynamics are included in the loss.

The only non-trivial implementation aspect is correct tracking of discrete Malliavin derivatives DnXΔD_n X^\Delta for the ZZ-loss, but otherwise code mirrors standard Deep BSDE approaches. All computations remain Monte Carlo, feed-forward, and backpropagation, with complexity linear in both spatial dimension dd, grid size NN, and network size.

6. Numerical Performance and Comparison with Classical Deep BSDE

On benchmark problems up to d=50d=50 (toy nonlinear drivers, high-dimensional Black–Scholes basket options, and HJB control equations), the forward differential deep-learning scheme achieves:

  • YY and ZZ errors $1$–$2$ orders of magnitude lower than deep BSDE methods lacking derivative supervision,
  • Γ\Gamma error reduced from O(1)\mathcal{O}(1) to 10210^{-2}10310^{-3},
  • Wall-clock runtime $2$–5×5\times smaller than classical methods when Γ\Gamma is approximated via autograd,
  • Empirical convergence rates β1.0\beta \approx 1.0–$1.7$, compared to nearly zero or negative rates for classical methods, for YY, ZZ error as a function of refinement (Kapllani et al., 2024).

This improvement is a consequence of imposing the joint dynamics of (Y,Z,Γ)(Y,Z,\Gamma) throughout the time grid—contrary to classical Deep BSDE, which parameterizes ZZ only, matches primarily at the terminal condition, and typically yields poor intermediate-time derivative estimates.

7. Broader Context and Extensions

The forward differential Deep BSDE method represents a significant refinement of the original Deep BSDE paradigm (Han et al., 7 May 2025). It adopts the Malliavin-lifting principle to ensure that every relevant pathwise sensitivity is explicitly parameterized and learned. This not only yields more accurate solution fields and derivatives, but is also robust and scalable in high dimension.

Unlike other extensions (control variate schemes, locally additive losses, XNet/Cauchy architectures, or pathwise/rough signature enrichments), the forward differential approach ensures direct, simultaneous training of YY, ZZ, and Γ\Gamma. This is especially pertinent for financial applications (where second-order greeks are critical) and for HJB-type semilinear and fully nonlinear PDEs.

The approach is compatible with further architectural improvements (e.g., Cauchy basis, transformer-style attention, hybrid PINN losses), but its principal distinguishing feature is the explicit and joint regression of the Malliavin-lifted triple. The method can be regarded as a natural foundation for future algorithmic development in high-dimensional stochastic PDE/B PDE solvers.

References:

(Kapllani et al., 2024, Han et al., 7 May 2025)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep BSDE Method.