Deep BSDE Filter

Updated 17 November 2025

Deep BSDE Filter is an approximate Bayesian nonlinear filtering method that reformulates the filtering problem using backward stochastic differential equations and deep neural networks.
It employs a nonlinear Feynman–Kac representation to derive rigorous error bounds and achieves O(Δt^(1/2)) convergence through controlled time discretization.
Practical implementations on test cases like the Ornstein–Uhlenbeck process and bistable drift models demonstrate mesh-free performance and rapid online inference.

The Deep BSDE Filter is an approximate Bayesian nonlinear filtering method based on backward stochastic differential equations (BSDEs). It reframes the evolution of conditional filtering densities in terms of a nonlinear Feynman–Kac representation and leverages deep learning—specifically, neural networks trained with deep BSDE approaches—for approximating these densities. The core advantages include the use of offline training for rapid online inference, preservation of a rigorous error bound, and the potential to remain mesh-free in higher dimensions.

1. Nonlinear Filtering and the Zakai Equation

Nonlinear filtering concerns estimating the conditional probability density $p(t, x \mid O_{1:k})$ of a hidden signal $S_t$ that evolves according to a stochastic differential equation (SDE): $dS_t = \mu(S_t)\,dt + \sigma(S_t)\,dB_t,\quad S_0 \sim \pi_0(x),$ where $B_t$ is a Brownian motion, $\mu$ and $\sigma$ are the drift and diffusion coefficients, and $\pi_0$ is the initial law. Observations are received at discrete times $t_k$ : $O_k = h(S_{t_k}) + V_k,$ with $V_k$ independent Gaussian noise. The unnormalized conditional density satisfies the Zakai equation between observation updates: $\partial_t p(t, x) = A^* p(t, x),$ for $t \in (t_k, t_{k+1}]$ , with instantaneous update at each arrival of an observation: $p(t_{k+1}^-, x) = \int p(t_{k+1}, x) L(O_{k+1}|x)\, dx,$ where $A^*$ is the adjoint of the generator $A\varphi = \frac{1}{2}\operatorname{Tr}[a(x) D^2\varphi] + \mu(x)\cdot\nabla\varphi$ , with $a = \sigma\sigma^\top$ , and $L(o|x)$ is the likelihood. In continuous observation settings, the Zakai equation can be written using Itô calculus as

$dp_t(x) = A^*p_t(x)\,dt + p_t(x)\,h(x)\cdot dY_t.$

2. Nonlinear Feynman–Kac and BSDE Representation

To exploit probabilistic representations, the filtering problem is recast via the (nonlinear) Feynman–Kac formula over each prediction interval $[t_k, t_{k+1}]$ . An auxiliary forward process is considered, independent of the observation path: $dX_s = \mu(X_s)\,ds + \sigma(X_s)\,dW_s, \qquad X_{t_k} = x,$ where $W_s$ is an independent Brownian motion. The terminal condition for the backward pass is defined recursively: $g_k(x, O_{1:k}) = p_{k-1}(t_k, x, O_{1:k-1}) L(O_k|x),$ with $g_0(x) = \pi_0(x)$ . The unnormalized density at $t_{k+1}$ is obtained as

$p_k(t_{k+1}, x, O_{1:k}) = \mathbb{E}\left[ g_k(X_{t_{k+1}}) + \int_{t_k}^{t_{k+1}} f(X_s, Y_s, Z_s)\, ds \right],$

where $(X_s, Y_s, Z_s)$ solve the uncoupled forward–backward SDE system for $s\in [t_k, t_{k+1}]$ : $\begin{gathered} dX_s = \mu(X_s)\,ds + \sigma(X_s)\,dW_s,\,\, X_{t_k}=x, \ Y_s = g_k(X_{t_{k+1}}) + \int_s^{t_{k+1}} f(X_r, Y_r, Z_r) dr - \int_s^{t_{k+1}} Z_r\,dW_r. \end{gathered}$ To produce the unnormalized density at any $t$ in $[t_k, t_{k+1}]$ , $Y$ is evaluated at the corresponding (reversed) time.

3. Deep BSDE Approximation and Neural Architecture

The backward SDE is discretized in time using a controlled process: $\mathcal{Y}_{n+1} = \mathcal{Y}_n - f(\mathcal{X}_n, \mathcal{Y}_n, \mathcal{Z}_n)\Delta t + \mathcal{Z}_n\Delta W_n,\quad \mathcal{Y}_0 = w(\mathcal{X}_{t_k}, O_{1:k}),$ where $\mathcal{X}_n$ is an Euler–Maruyama path and $\Delta W_n$ are independent Brownian increments. The solution is parameterized by neural networks:

$w^\theta(x, O_{1:k})$ approximates $Y_{t_k}$
$v_n^\theta(x, O_{1:k})$ approximates $Z$ at time step $t_{k,n}$

Training occurs via minimization of the empirical terminal loss over $M$ simulated trajectories: $\ell(\theta) = \frac{1}{M} \sum_{m=1}^M \left| \mathcal{Y}_N^{(m)} - \overline g_k(\mathcal{X}_N^{(m)}, O_{1:k}^{(m)})\right|^2,$ where $\overline g_k$ may be normalized or unnormalized at the terminal point.

The network design includes:

$w$ -network: fully connected, ReLU, 3 hidden layers of size 128, exponential output activation, input dimension $d + (d' \times(k-1))$
$v$ -networks: one per time step, 3 hidden layers of size 32, linear output, same input size
Training: Adam optimizer, learning rate $10^{-4}$ , batch size 512, up to 100 epochs with early stopping (patience 5 epochs), and parameter sharing across observation steps via zero-padding of unused observations

Normalization of densities is performed using quadrature (for $d=1$ , with $J=10^3$ evaluation points on $[-5,5]$ ). The dominant training cost scales as $K \times N \times M_\text{batches}$ .

4. Error Analysis and Theoretical Bounds

Under smoothness and uniform ellipticity conditions, a mixed a priori–a posteriori error bound is established. Specifically, for the maximum deviation of the learned density: $\max_{1 \leq k \leq K} \|p_k(t_k) - \widehat p_k\|_{\infty} \leq C\left(\tau^{1/2} + \sum_{j=0}^{K-1} \sup_{x,O}\mathbb{E}\Big[|\overline g_j(\mathcal{X}_N^{j,x}, O_{1:j}) - \mathcal{Y}_N^{j,x}|^2\Big]^{1/2}\right)$ where $\tau = T/(KN)$ . The error consists of an explicit time-discretization term $O(\tau^{1/2})$ and a residual a posteriori (learning) term reflecting empirical convergence.

5. Representative Numerical Experiments

Two test cases provide numerical validation of the approach: - Ornstein–Uhlenbeck process (linear): $\mu(x) = -x$ , $\sigma(x) = 1$ , $h(x) = x$ , $R=1$ ; reference solution by analytic Kalman–Bucy. With $K=10$ , $T=1$ , and $N=2^{j}$ for $j=0\ldots6$ , the observed final-time error $e_K$ and accumulated residual $E$ exhibit $N^{-1/2}$ convergence, with uniform accuracy over observation steps. - Bistable drift: $\mu(x) = (2/5)(5x - x^3)$ , $\sigma = 1$ , $h(x) = x$ ; reference solution via $10^5$ -particle bootstrap filter with KDE. Using the same $T$ , $K$ , and $N$ , $e_K$ and $E$ again show $N^{-1/2}$ decay up to $N=16$ , beyond which a plateau signals that the learning residual becomes dominant.

6. Practical Implementation Guidance

Adaptation to higher dimensions and different model classes follows several best practices:

Employ richer neural network architectures (e.g., time embeddings, UNets) for spatially high-dimensional $x$ .
Combine multilevel Monte Carlo (MLMC) strategies: begin with coarse time-steps ( $N$ ), then fine-tune on finer grids without reinitializing weights.
Randomize sampled time-steps during training to ensure robust performance for all $t \in [t_k, t_{k+1}]$ .
Use a sufficiently large number of Monte Carlo samples $M$ to reduce the a posteriori residual below the discretization error.
Normalize densities in higher dimensions with robust quadrature or importance sampling.

The interface from Zakai to BSDE formulation and then to neural approximation with sequential updates yields a Deep BSDE Filter achieving mesh-free $O(\Delta t^{1/2})$ convergence rates in time and empirical consistency across multiple nonlinear filtering scenarios.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Deep BSDE Filter.