Uncertainty Quantification Benchmark

Updated 10 August 2025

Uncertainty Quantification Benchmark is a standardized framework that defines and compares UQ techniques using realistic datasets and performance metrics.
It evaluates methods such as aPC, adaptive sparse grids, kernel-based interpolation, and hybrid stochastic Galerkin to balance accuracy and computational efficiency.
The benchmark employs practical test scenarios like CO₂ storage to rigorously assess uncertainty propagation and guide the selection of suitable UQ approaches.

Uncertainty quantification (UQ) benchmarks are standardized scenarios, datasets, or methodological frameworks used to rigorously compare, evaluate, and guide the selection of UQ techniques in computational science, engineering, and applied machine learning. These benchmarks provide reference solutions, well-defined performance metrics, and realistic problem formulations that reflect the typical sources, structures, and consequences of uncertainty in modeling and data-driven inference. In the context of subsurface flow, for example, UQ benchmarks are indispensable for assessing the predictive and computational characteristics of alternative non-intrusive and intrusive UQ methods, especially where the available observational data does not permit the construction of precise parametric probability distributions.

1. Key UQ Methods and Their Operational Principles

Uncertainty quantification in complex simulation scenarios relies on a range of surrogate modeling and stochastic discretization techniques, each with distinct theoretical foundations and computational trade-offs. The main types evaluated in benchmark studies of CO₂ storage (Köppel et al., 2018) include:

Arbitrary Polynomial Chaos (aPC): Expands the model output as a sum of multivariate orthonormal polynomial basis functions $\{\psi_i(\xi)\}_{i=0}^{N_p}$ constructed from statistical moments of the empirical input distribution. The representation takes the form

$S(r, t; \xi) \approx \sum_{i=0}^{N_p} S_i(r, t)\, \psi_i(\xi)$

Two non-intrusive computational strategies are common: minimally sampling using probabilistic collocation (PCM) for efficiency, and a full tensor grid with least-squares fitting to reduce oscillatory artifacts in higher dimensions.

Spatially Adaptive Sparse Grids: Constructs surrogates in high-dimensional parameter space using hierarchical local basis functions, with refinement targeted in regions of high local error (e.g., as indicated by weighted $L^2$ norms). Adaptive sparse grids counter exponential scaling in the number of samples (the "curse of dimensionality") and can place grid points both at boundaries and in the domain interior using linear extrapolation.
Kernel-Based Greedy Interpolation: Builds sparse, data-driven surrogates by selecting a quasi-optimal set of “center” points—using, for instance, the power function as an informativeness criterion—to interpolate the output with compactly supported kernels such as the Wendland kernel. The approximation is $S_N(\xi) = \sum_{i=1}^N \alpha_i\, k(\xi, \xi_i)$ , with coefficients $\{\alpha_i\}$ fitted via interpolation and sample point selection driven by the Vectorial Kernel Orthogonal Greedy Algorithm (P-VKOGA).
Hybrid Stochastic Galerkin (HSG): An intrusive strategy in which the governing PDEs (e.g., hyperbolic transport equations) are projected onto tailored polynomial chaos expansions within a partitioned stochastic domain (“multi-element” decomposition). The expansion is

$S(r, t; \xi) \approx \sum_{l\in\mathcal{L}} \sum_{p=0}^{N_0} s_p^l(r, t) B_p^l(\xi)$

where $B_p^l$ are locally supported polynomial bases and solution coefficients are obtained by solving a coupled deterministic system over all elements.

2. Structure of the Benchmark Scenario and Governing Formulation

A prototypical UQ benchmark for geoscience applications is constructed using a simplified physical model with problem parameters derived from credible site data. In the CO₂ storage benchmark (Köppel et al., 2018):

Physical setting: The injection of CO₂ into a saline aquifer is governed by the nonlinear, capillarity-free fractional flow formulation for two incompressible phases, reduced to a radial, one-dimensional representation near the well.
Pressure equation: The pressure profile, supporting deterministic solution, is

$p(r) = p_{\max} - \frac{C_p}{k_A} \ln(r)$

where $k_A$ is permeability and $C_p$ depends on uncertain boundary conditions.

Transport equation: The saturation is propagated using a central-upwind finite volume method adapted to the radial coordinate.
Input data: Model parameters are physically plausible (e.g., from site databases), with sufficient spatial and temporal discretization (e.g., 250 cells, 10,000 Monte Carlo samples) to yield converged moment estimates for reference.

3. Sources and Modeling of Uncertainty

The benchmark deliberately incorporates multiple realistic sources of parametric uncertainty, each encoded as an independent random variable:

Source	Mathematical Representation	Impact on Model
Boundary conditions	$Q(\xi_1)$ , with $p(r,\xi_1)$ explicit in $Q$	Variable injection rate
Conceptual model	$k_{rg}(S_{eff},\xi_2) = (S_{eff})^{\xi_2}$	Nonlinearity in relative permeability
Material properties	$\phi_0(\xi_3)$	Reservoir porosity variability

All uncertain parameters are propagated via their empirically estimated distributions to ensure realism, with the full reference solution constructed from the ensemble of Monte Carlo samples.

4. Performance Metrics and Comparative Criteria

Accurate benchmarking necessitates rigorous, interpretable summary metrics:

Expectation (Mean): The spatial and temporal mean of CO₂ saturation is computed by each UQ method and compared to the Monte Carlo reference.
Standard Deviation (Variance): The second moment is quantified, revealing the predicted spread of saturation as a function of space and time.
Convergence Analysis: For surrogate models, error decay is plotted as a function of the number of full model runs (or grid resolution); for HSG, accuracy is tracked vs. polynomial order and element count.
Efficiency and Scalability: The computational burden—measured in model evaluations and cost to reconstruct predictions/uncertainties—is compared across approaches, yielding practical guidance for modelers.

5. Advantages, Disadvantages, and Practical Implementation Guidance

The benchmark exposes the trade-offs inherent in each method:

Method	Advantages	Disadvantages
aPC	Efficient for low-order, low-dimension	Prone to oscillations, global basis sensitivity
Sparse grids	Adaptivity, high-dimension scalability	Complexity in grid refinement, boundary point placement
Kernel greedy	Sparse, fast surrogate; quasi-optimal convergence	Needs minimal #samples; can face slow convergence
Hybrid Galerkin	Full intrusive statistics, postprocessing flexibility	Solver modification, curse of dimensionality

A key conclusion is that low-order aPC (or low-resolution HSG) may suffice for rough moment estimates, but the accurate resolution of standard deviation—especially near discontinuities—favors adaptive approaches like sparse grids or kernel-based interpolation. For high-dimensional or highly nonlinear problems, surrogate models with adaptive refinement are generally preferable.

6. Analytical Formulation and Key Mathematical Expressions

The benchmark formalizes each UQ method with explicit expansions:

aPC expansion: $S(r, t; \xi) \approx \sum_{i=0}^{N_p} S_i(r, t)\, \psi_i(\xi)$
Sparse grid surrogate: $u(r, t; \xi) \approx \sum_{l, i \in I} v_{l,i}(r, t) \phi_{l,i}(\xi)$
Kernel interpolant: $S_N(\xi) = \sum_{i=1}^N \alpha_i\, k(\xi, \xi_i)$
Error bound for kernel interpolation: $\| S(\xi) - S_N(\xi) \|_\infty \leq P_X(\xi) \|S\|_{H}$
HSG expansion: $S(r, t; \xi) \approx \sum_{l \in \mathcal{L}} \sum_{p=0}^{N_0} s_p^l(r, t) B_p^l(\xi)$

These explicit representations are critical for implementation and benchmarking, as they directly govern both model accuracy and computational requirements.

7. Recommendations for Modelers and Benchmarking Best Practices

The benchmark paper yields the following guidance:

Select aPC (with PCM) or low-resolution HSG for simple, low-cost UQ in low dimensions and when only means are needed.
Use adaptive sparse grids or kernel-based greedy surrogates when accurate uncertainty quantification (including second moments and local features) is required or when faced with higher-dimensional parameter spaces.
Use intrusive (Galerkin-type) methods when full probabilistic reformulation is needed and postprocessing flexibility is a priority, but be aware of significant code modifications and increased computational burden.
Match the computational budget to the demands of the required output accuracy and consider the presence of discontinuities or sharp fronts, as global basis expansions face accuracy breakdown in such cases.
For CO₂ storage and analogous settings with limited data for parameter distribution estimation, prefer UQ methods that are robust to distributional misspecification and efficiently capture empirical uncertainty propagation.

In summary, rigorous UQ requires benchmarks that mimic true operational uncertainty and allow for exhaustive comparison of methods on accuracy, efficiency, and scalability. The CO₂ storage benchmark (Köppel et al., 2018) exemplifies such a standard by combining physically motivated parametrizations, multiple uncertainty sources, and a framework that allows head-to-head assessment of both intrusive and non-intrusive UQ methodologies.

PDF Markdown Chat (Pro)

References (1)

Comparison of data-driven uncertainty quantification methods for a carbon dioxide storage benchmark scenario (2018)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Uncertainty Quantification Benchmark.