Noise-Aware Surrogate Modeling

Updated 5 July 2025

Noise-aware surrogate modeling is a set of techniques that explicitly accounts for noise from errors, approximations, and model misspecification in predictive tasks.
Its methodologies include error modeling, feature-based regression, clustering, and Gaussian process approaches to enhance reliability in noisy environments.
These techniques improve decision-making in engineering and science by robustly quantifying uncertainty and correcting surrogate model outputs.

Noise-aware surrogate modeling encompasses a broad class of strategies for constructing surrogate models that explicitly account for the presence and influence of noise—whether originating from measurement errors, numerical approximations, stochastic simulation outputs, or model misspecification—in the process of learning, inference, or uncertainty quantification. Modern noise-aware methods move beyond traditional, noise-agnostic surrogates by quantifying, propagating, and often correcting for uncertainty, enabling more reliable predictions and robust decision-making in data-driven and computational science applications.

1. Foundational Principles and Motivation

Surrogate modeling aims to approximate expensive or inaccessible physical, computational, or conceptual models using tractable (often statistical or machine learning-based) tools. In practice, the data available for surrogate construction, and often for its eventual use, are typically contaminated by noise. This noise can be:

Aleatory: inherent stochasticity in the physical process or simulator ( $\epsilon_\text{A}$ )
Epistemic: uncertainty in the model parameters or structure, often due to limited or biased training data
Approximation error: mismatch between the surrogate and its target, which is particularly acute for imperfect or low-fidelity surrogates

Accounting for these noise sources is essential for trustworthy surrogate construction, error prediction, reliability analysis, and uncertainty propagation, especially in engineering and scientific domains where decision processes are sensitive to the tails of the distribution or out-of-sample behavior (1701.03240, 2312.05153, 2401.06447, 2412.11875).

2. Error Modeling and Feature-based Approaches

A central strategy for noise-aware surrogate modeling is the explicit prediction of surrogate error, leveraging auxiliary information or “error indicators” produced by the surrogate itself. One influential framework (1701.03240) entails:

Feature construction: Generate a high-dimensional set of surrogate-produced error indicators, including current and historical state variables, operator information (e.g., Jacobians), parameter differences, and global metrics like pore volume injected or timestep size.
Regression modeling: Model the error $\delta^n \equiv q^n_\text{HFM} - q^n_\text{surrogate}$ as $\delta^n = q(\phi^n) + \epsilon^n$ , where $\phi^n$ is the feature vector and $\epsilon^n$ is residual noise.
Locality via clustering/classification: Partition the feature space into clusters or regimes—determined either through supervised (classification) or unsupervised (clustering) methods—leading to local (regime-specific) regressors, improving fit in heterogeneous or nonstationary regimes.
Direct correction and uncertainty quantification: Use the trained regression to correct surrogate outputs and provide statistical error estimates (for example, via integration into the ROMES methodology).

This approach is particularly notable for large-scale, time-dependent, or parameter-dependent systems, such as reduced-order models for subsurface flow, and demonstrates improved accuracy and uncertainty characterization over naive surrogates (1701.03240).

3. Random Process-based and Gaussian Process Surrogates

Gaussian process (GP) regression forms one of the principal noise-aware surrogate modeling paradigms (1707.03916, 2109.05324, 2401.06447). In these methods:

Observation modeling: Outputs are assumed to be noisy realizations of a latent function: $y(x) = f(x) + \epsilon$ , with $\epsilon$ typically modeled as (possibly input-dependent) Gaussian noise.
Uncertainty quantification: The posterior GP delivers both a predictive mean and variance, the latter reflecting the influence of data noise (manifesting as a “nugget” term in the kernel) and epistemic uncertainty due to limited data.
Variable fidelity modeling: Extension to multi-fidelity data is achieved via co-kriging, in which low- and high-fidelity sources are fused using models such as $y_H(x) = \rho y_L(x) + y_D(x)$ , with $y_D(x)$ and $y_L(x)$ modeled as independent GPs, each with their own noise model (1707.03916, 2401.06447).
Scalability and replication handling: For large datasets or those with substantial replication (a haLLMark of stochastic simulation), scalable local GP variants such as LAGP and inducing point methods (LIGP) use mathematical identities (Woodbury) to enable local, replication- and noise-aware approximation without requiring expensive global kernel inversion (2109.05324).
Active learning and reliability: Noise-aware learning functions, particularly in reliability analysis, are designed to account for inherent noise by explicitly incorporating the irreducible prediction variance when guiding the selection of new data points, ensuring sampling focuses on informative—rather than merely noisy—regions (2401.10796).

4. Deep, Generative, and Hybrid Surrogate Models

Recent advances have enabled surrogates that are not only point predictors but also generative models for the full conditional distribution of outputs given inputs:

Latent-variable and adversarial frameworks: Deep surrogate models, such as conditional variational autoencoders and adversarially trained models, model the relationship $p(y|x)$ via latent variables $z$ and optimize an evidence lower bound (ELBO) and adversarial objectives to capture both aleatoric and epistemic uncertainties (1901.04878). These approaches naturally absorb non-Gaussian, heteroscedastic, or non-additive noise structures.
Noise contrastive priors for neural nets: Bayesian neural networks with noise contrastive priors (BNN-NCP) further regularize epistemic uncertainty by injecting synthetic out-of-distribution samples and penalizing overconfident predictions far from the training distribution, enforcing realistic uncertainty estimates (2402.00760).
Pseudo-reversible normalizing flows: Surrogate models based on conditional pseudo-reversible normalizing flows learn the full conditional output distribution $p(y|x)$ directly from noisy data, without requiring prior knowledge of the noise model. Such architectures leverage neural networks with “soft” invertibility constraints and change-of-variable formulae to learn expressive, invertible mappings, and enable theoretical convergence analysis in Kullback-Leibler divergence (2404.00502).
Hybrid surrogates and multi-source integration: Bayesian frameworks for combining simulation and measurement data employ either a convex combination of independently trained surrogates’ posterior predictive distributions (posterior predictive weighting) or joint likelihood-based training with power-scaling parameters to reflect the relative trust in simulation versus real data (2412.11875).

5. Quantitative Uncertainty Propagation and Inference

A central aim in noise-aware surrogate modeling is to propagate uncertainty, not only in prediction, but also into downstream inference and decision-making:

Epistemic and aleatoric separation: Frameworks explicitly distinguish between uncertainty due to finite data ( $T$ -epistemic, e.g., posterior over surrogate parameters) and irreducible model error ( $T$ -aleatoric, e.g., noise in simulation or measurement). Bayesian inference methods integrate over the full posterior of surrogate parameters to ensure that all uncertainty sources are reflected in prediction and inference (2312.05153).
Confidence and prediction intervals: In multi-fidelity surrogate schemes, confidence intervals (CI) quantify uncertainty in estimating the noise-free model, while prediction intervals (PI) additionally account for the irreducible observation noise (e.g., $P[ \psi_{\text{lo}}(x) < \psi_H(x) < \psi_{\text{up}}(x)] = 1-2\alpha$ for CI, and $P[y_{\text{lo}}(x,\epsilon_H) < y_H(x,\epsilon_H) < y_{\text{up}}(x,\epsilon_H)] = 1-2\alpha$ for PI) (2401.06447).
Sobol indices and sensitivity analysis: For surrogates based on polynomial chaos expansions (PCE), the analytic structure allows direct computation of global sensitivity indices that separate contributions from parametric uncertainty and intrinsic noise, illuminating which sources most drive model output variance (2311.00553).
Calibration and validation: Simulation-based calibration, simulation studies, and diagnostic plots (e.g., coverage, expected log density) assess whether propagated uncertainties align with empirical error rates, determining whether a surrogate is well-calibrated or remains overconfident beneath prevailing noise (2312.05153, 2402.01810).

6. Methodological Implications and Practical Applications

Noise-aware surrogate modeling is applied and benchmarked in a range of challenging contexts:

Engineering and scientific simulations: Models for oil–water subsurface flow (1701.03240), wind turbine blade loading (2401.06447), vibroacoustics of musical instruments (1802.10487), and large-scale option pricing (2109.05324) exemplify how noise handling and uncertainty quantification translate into practical accuracy gains and computational efficiency.
Data assimilation and reliability analysis: Integration into Bayesian inference workflows allows measurement noise and model discrepancy to be reflected in posterior parameter distributions, while denoising surrogates enable reliable failure probability estimates even for noisy (or inherently stochastic) limit-state functions (2401.10796, 2312.05153).
Hybrid modeling and model diagnosis: Techniques for merging simulation and measurement data enable diagnosis of model misspecification and assignment of predictive trust according to empirical fit versus physical knowledge, enhancing robustness and suggesting directions for future model refinement (2412.11875).
Generative simulation and uncertainty propagation: Surrogates that admit sampling from conditional output distributions support forward and inverse uncertainty quantification (e.g., propagating uncertainty through a chain of model steps, or reconstructing input posteriors from measurements) (1901.04878, 2404.00502).

7. Limitations and Frontiers

Contemporary noise-aware surrogate modeling frameworks face several practical and theoretical challenges:

Curse of dimensionality: High-dimensional parameter spaces or outputs, while increasingly tractable with advances in deep generative surrogates and dimensionality reduction, may still limit credible uncertainty quantification or require substantial training data (1901.04878, 2404.00502).
Model misspecification and low-noise regimes: For nearly deterministic models, standard Bayesian or loss-minimizing approaches may severely underestimate predictive uncertainty. Recent advances highlight the need for occupancy-or ensemble-based posterior representations to ensure generalization error remains finite (2402.01810).
Computational complexity: Probabilistic surrogates (e.g., latent GP, deep BNN) can incur substantial cost in posterior inference or sampling, prompting development of scalable local approximations, efficient sampling strategies, and computationally efficient uncertainty estimation (2109.05324, 2402.00760).
Integration of multiple sources and trust assignment: Determining appropriate trust/weighting for hybrid surrogates remains an active area, with ongoing research focused on adaptive power-scaling, stacking, and optimal transport metrics for blending predictions (2412.11875).

Overall, noise-aware surrogate modeling is a mature yet rapidly evolving set of techniques, responding to the dual demands of computational tractability and rigorous uncertainty quantification in noisy, complex modeling regimes across science and engineering.