Parallel-in-Time Kalman Filters

Updated 20 November 2025

Parallel-in-Time Kalman Filters are advanced algorithms that recast sequential Bayesian estimation into parallel prefix-sum operations for state-space models.
They reduce the classical O(N) time complexity to O(log N) by leveraging associative scan primitives, significantly speeding up filtering and smoothing tasks.
These methods have been applied to Gaussian Process regression, probabilistic ODE solvers, and integrated measurement systems, achieving substantial computational speedups on modern hardware.

Parallel-in-time Kalman filters are temporally parallel algorithms for recursive Bayesian estimation in dynamical systems modeled by state-space equations. Their essential advance is the reduction of algorithmic span—critical path length—from classical linear-in-time ( $O(N)$ ) update sequences to logarithmic ( $O(\log N)$ ) depth using associative scan primitives. This temporal parallelism, distinct from spatial or data parallelism, enables effective exploitation of modern hardware such as GPUs and multi-core CPUs for filtering and smoothing tasks in linear and certain nonlinear models, including temporal Gaussian processes, probabilistic ODE solvers, and integrated measurement systems.

1. State-Space Models and Classical Kalman Filtering

Dynamical systems for filtering are generally modeled as discrete- or continuous-time state-space equations: $\begin{aligned} \mathbf{x}_{k} &= F_{k-1}\,\mathbf{x}_{k-1} + \mathbf{u}_{k-1} + \mathbf{w}_{k-1}, \quad \mathbf{w}_{k-1} \sim \mathcal{N}(0,Q_{k-1}) \ \mathbf{y}_k &= H_k\,\mathbf{x}_k + \mathbf{d}_k + \mathbf{v}_k, \quad \mathbf{v}_k \sim \mathcal{N}(0,R_k) \end{aligned}$ The classical Kalman filter consists of sequential predict and update recurrences, with the Rauch–Tung–Striebel (RTS) smoother extending the analysis backward in time. For $N$ time steps, classical filtering and smoothing intrinsically require $O(N)$ time due to temporal dependency. This sequential bottleneck limits throughput, especially for large-scale or high-frequency problems (Corenflos et al., 2021).

2. Associative Scan Formulation for Temporal Parallelization

The core insight enabling parallel-in-time filtering is to recast the temporal recurrences as associative operations:

At each time step $k$ , construct an "element" (e.g., $(A_k, b_k, C_k, \eta_k, J_k)$ ) encoding the update effect.
Define a binary operator $\otimes$ such that

$s_k = a_1 \otimes a_2 \otimes \dots \otimes a_k$

is associative and produces the filter mean/covariance at time $k$ .

For the standard Kalman filter, the operator is derived such that composing these elements using parallel prefix-sum (scan) primitives yields the exact posterior statistics for all time points (Corenflos et al., 2021, Särkkä et al., 13 Nov 2025). This applies to both filtering and smoothing; for smoothing, reverse scans of appropriately constructed elements compute the backward information flows.

The key associative update for filtering is

$\begin{aligned} A_{ij} &= A_j (I + C_i J_j)^{-1} A_i \ b_{ij} &= A_j (I + C_i J_j)^{-1}(b_i + C_i \eta_j) + b_j \ C_{ij} &= A_j (I + C_i J_j)^{-1} C_i A_j^\top + C_j \ \eta_{ij} &= A_i^\top (I + J_j C_i)^{-1} (\eta_j - J_j b_i) + \eta_i \ J_{ij} &= A_i^\top (I + J_j C_i)^{-1} J_j A_i + J_i \end{aligned}$

where all matrix operations can be applied in parallel if the underlying hardware supports concurrent execution (Corenflos et al., 2021, Särkkä et al., 13 Nov 2025).

3. Algorithmic Primitives, Complexity, and Implementation

Parallel prefix-sum algorithms implement the scan efficiently:

Blelloch scan: Two-phase up-sweep and down-sweep producing $O(T)$ work and $O(\log T)$ span.
Ladner–Fischer scan: In-place tree circuit, $O(T)$ work and $O(\log T)$ span, minimal memory overhead.
Hillis–Steele scan: Simpler, $O(T \log T)$ work and $O(\log T)$ span, generally suboptimal for large $T$ .

On GPUs and multicore systems, each time step $k$ is assigned to a thread or core. Matrix operations per element dominate the work, scaling as $O(n_x^3)$ for state dimension $n_x$ . Two passes (filtering and smoothing) complete all moments in $O(\log N)$ parallel time, with $O(N)$ total work and $O(\log N)$ synchronizations (Särkkä et al., 13 Nov 2025).

GPU/CPU implementation involves:

Kernel for local element construction
Kernel(s) for scan operations, with synchronization after each tree merge layer
Optional backward scan for smoothing
Memory and computational graphs (e.g., TensorFlow, PyTorch) to enable automatic differentiation and learning (Corenflos et al., 2021)

4. Extensions and Applications

Parallel-in-time Kalman filtering has been adapted to:

Gaussian Process regression in state-space form: Provides $O(\log N)$ inference for large temporal GPs. Experiments show >10× speedups vs sequential approaches for $N \sim 10^4$ (Corenflos et al., 2021).
Probabilistic numerical ODE solvers: IEKS, adapted for prefix-scan, supports parallel filtering and smoothing for latent SDE models; speedup up to 100× observed for $N \ge 10^3$ (Bosch et al., 2023).
Slow-Rate Integrated Measurement systems: SRTM models—integrated measurements over intervals—are restructured using associative operators; empirical GPU speedups of 8–10× for $N=6000$ reported (Yaghoobi et al., 2024).
Block-partitioned, space-time parallel KFs: Domain decomposition with overlap-regularization enables parallel Kalman filtering across both time and spatial subdomains with quadratic scaling in domain count (D'Amore et al., 2023).
Orthogonal transformation-based parallel smoothers: QR and selective inversion algorithms provide numerically stable, highly parallel smoothing, with up to 47× speedup over sequential RTS methods on 64-core ARM/Intel systems (Gargir et al., 17 Feb 2025).
Two-filter parallel smoothers: Forward and backward filters via prefix-scans on separate GPUs deliver up to 1.5× speedup over single-GPU smoothing (Särkkä et al., 13 Nov 2025).

5. Numerical Stability and Limitations

Numerical stability varies by method:

Pure associative scan methods, per Särkkä & García-Fernández, offer no direct backward stability guarantees; accumulating rounding errors across parallel merges is possible, particularly for long chains or high $n_x$ (Gargir et al., 17 Feb 2025).
Householder QR-based parallel smoothers are conditionally backward-stable, with error bounded by $O(\epsilon)\,\|UA\|$ , as in conventional QR methods. Selective inversion (SelInv) can be used to extract covariance blocks efficiently and stably (Gargir et al., 17 Feb 2025).
Square-root covariance formulations via QR ensure stability in probabilistic ODE solvers and GP regression (Bosch et al., 2023).

Limiting factors include:

Cost per time step grows as $O(n_x^3)$ due to matrix algebra.
For very large $n_x$ (e.g., $n_x \gtrsim 14$ ), memory and compute become bottlenecks.
Exactness applies only to linear-Gaussian and state-space models; nonlinear latent states require further generalization (e.g., sigma-point or extended KF variants).
Prefix-scan methods require enough temporal steps ( $N \gg$ available cores) to amortize kernel launch and synchronization overhead (Särkkä et al., 13 Nov 2025).

6. Comparative Performance and Practical Recommendations

Empirical analysis reveals:

On GPUs, the Blelloch and Ladner–Fischer scans provide optimal work/span for $N \geq 10^5$ ; for smaller $N$ , simpler scans (Hillis–Steele/Sengupta B) may have lower overhead.
Two-filter smoothers excel for very large $N$ if two GPUs are available. On a single GPU, standard PRTS scan-based smoothers are simpler and slightly faster.
Speedups of up to 500–750× (GPU) and 47× (multi-core CPU) are attainable for sufficiently large $N$ and moderate $n_x$ (Corenflos et al., 2021, Gargir et al., 17 Feb 2025, Särkkä et al., 13 Nov 2025). Memory footprint considerations favor in-place scans for massive $N$ .

Algorithm	Complexity (work/span)	Speedup	Typical Hardware
Assoc. Scan (PSSGP, PRTS)	$O(N)$ / $O(\log N)$	$10$– $750\times$	GPU, multicore CPU
Odd–Even QR (UltimateKF)	$O(N n_x^3)$ / $O(\log N\, n_x\log n_x)$	$20$– $47\times$	Multicore CPU
DD-KF (space-time)	$O(N_{\text{sub}}^2)$ scaling	Near-quadratic in $N_{\text{sub}}$	Distributed clusters

7. Connections, Limitations, and Future Directions

Parallel-in-time filtering and smoothing represent a fundamental shift in temporal algorithm design for Bayesian state estimation. They subsume and outperform prior sparse-matrix and block-tridiagonal methods that lacked black-box associative scan capacity (Corenflos et al., 2021). Ongoing research explores:

Nonlinear extensions via extended/sigma-point scans (Corenflos et al., 2021, Bosch et al., 2023).
Distributed-memory, multi-GPU implementations for massive time series.
Integration with variational and sparse GP techniques to alleviate $n_x$ and $N$ scaling (Corenflos et al., 2021).
Optimized kernels for memory and synchronization efficiency in practical deployment (Särkkä et al., 13 Nov 2025).
Space–time decomposition for complementary spatial parallelism (D'Amore et al., 2023).

A plausible implication is that for appropriate models and hardware, parallel-in-time Kalman filtering and smoothing are poised to become the standard for high-throughput, high-dimensional, recursive Bayesian estimation in temporal applications. Limitations regarding model generality, numerical stability, and per-step computational cost remain areas of active development.