Decision-Aware Evaluation of Physics-Informed Surrogates

Published 5 Jun 2026 in cs.LG and cs.CE | (2606.07146v1)

Abstract: Physics-informed machine learning is often assessed by curve error, although engineering use depends on downstream decisions: ranking candidates, avoiding infeasible designs and limiting regret. We introduce pinn-gym, an open benchmark for material-conditioned lattice design that couples a transparent reduced-order crush-and-impact oracle with five printable polymer cards, dimensionless force-response targets and a protocol spanning curve fidelity, physical admissibility, top-k retrieval and mass regret. Across per-material, pooled and cross-material settings, low nRMSE is frequently insufficient to identify useful design selections. Physics-informed losses alter trade-offs rather than monotonically improving all metrics, and dimensionless conditioning improves comparability without making transfer symmetric. The benchmark is not a certified material model; within the released oracle, candidate generator and material cards, pinn-gym provides a reproducible testbed for evaluating PIML surrogates as decision systems rather than curve predictors alone.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces pinn-gym, a benchmark that shifts evaluation from traditional curve-fitting to decision-aware metrics for material-conditioned design.
It reveals that lower curve-error metrics do not guarantee superior design utility, as evidenced by discrepancies in feasible candidate retrieval and constraint adherence.
It demonstrates how physics-informed loss terms and asymmetric cross-material transfer impact surrogate performance, urging tailored evaluation strategies.

Decision-Aware Evaluation of Physics-Informed Surrogates: A Technical Analysis

Introduction and Motivation

Physics-Informed Machine Learning (PIML) models are widely deployed to accelerate simulation-driven design workflows, especially where evaluating candidate structures with high-fidelity simulation or experimentation is costly. However, standard evaluation of PIML models typically relies on metrics such as curve-fitting error (e.g., nRMSE), neglecting the ultimate task—decision-making—where these surrogates are used to select, rank, and filter design candidates under physical constraints. "Decision-Aware Evaluation of Physics-Informed Surrogates" (2606.07146) addresses this practical-pipeline deficiency by introducing pinn-gym, a transparent, reproducible benchmark designed to evaluate PIML surrogates in the context of material-conditioned design and ranking for energy-absorbing polymer lattices.

Figure 1: Overview of the pinn-gym benchmark, visualizing the material-aware sampling, oracle labeling, surrogate modeling pipeline, and the joint evaluation metrics over candidate design pools.

Benchmark Structure and Design Protocol

The pinn-gym benchmark centers on a reduced-order "oracle" that evaluates lattice geometries across five printable polymer "cards," producing dimensionless force--displacement responses ( $\hat{f}(\epsilon)$ ), absorbed energies, peak forces, and binary feasibility outcomes. The focus is on selecting lightweight, constraint-satisfying designs for a fixed impact scenario and fixture. Material-specific input vectors parameterize the surrogates, enabling both per-material specialists and generalized, material-conditioned models to be evaluated across several regimes.

The benchmark features several innovations:

Material-Conditioned Candidate Pools: Each pool is constructed around the feasibility boundaries induced by the physical constraints and material card properties, ensuring that both feasible and infeasible design regimes are well-sampled and that decision metrics are meaningfully stress-tested.
Dimensionless Input Representation: By applying Buckingham- $\pi$ scaling laws, the models operate on nondimensional inputs, with material descriptors as continuous conditioning vectors, supporting zero-shot cross-material transfer evaluation.
Heterogeneous Metrics: Evaluation includes not just nRMSE for curve fitting but also the physical violation rate, top- $k$ feasible retrieval (P@ $k$ ), and regret-at- $k$ (mass penalty over the lightest oracle-feasible design), explicitly dissecting model performance as both a predictor and a decision system.
Figure 2: One oracle, five design regimes—material cards produce widely varying feasible fractions and mass distributions, meaning that decision ceilings imposed by feasibility are material-dependent.

Dissociation of Curve Fidelity and Decision Utility

A central empirical finding is the frequent dissociation between traditional curve-error metrics (e.g., nRMSE) and actual design utility. For instance, within a single material regime, the model with the lowest curve nRMSE may perform suboptimally on top- $k$ feasible selection, and can even exhibit elevated oracle violation rates—demonstrating that regression accuracy is not a robust proxy for design-rank quality.

Figure 3: Within a material card, models with superior curve fidelity can underperform in feasible design retrieval and constraint safety, highlighting the inadequacy of single-metric evaluation.

This non-alignment is most acute for stiffer materials (e.g., PA-CF), where curve-fitted models can endorse physically inadmissible, high-risk designs, whereas physics-informed models with slightly worsened fit yield far better safety and selection performance. Accordingly, the authors argue that any PIML evaluation for design must report all metric families jointly.

Effect of Physics-Informed Losses

Investigation of loss function trade-offs reveals that physics-based objective terms (e.g., energy consistency, peak-force penalty, monotonicity, smoothness) do not provide monotonic improvements across all metric dimensions. Instead, they act as design-space priors that must be explicitly balanced against data fidelity. For example, a strong energy-integral term reduces nRMSE and violation rate but can reduce feasible retrieval when overemphasized; composite loss terms shift balance points, trading curve fit for admissibility and selection utility.

This finding directly challenges the notion that "more physics is strictly better" for surrogate modeling in design-driven settings, implying that the scalarization of physics-based losses should be task- and metric-sensitive.

Pooled Material-Conditioned Surrogates and Cross-Material Transfer

Pooled, dimensionless surrogates—conditioned via continuous material vectors—are shown to fit multiple materials simultaneously, achieving moderate overall nRMSE. However, decision metrics (feasibility retrieval, regret, violation rates) remain strongly material-specific, and a pooled average statistic risks obscuring critical per-material failures. This result motivates material-by-material and use-case-driven reporting in surrogate evaluation.

Figure 4: Pooled, material-conditioned surrogates can learn multi-material responses, but decision metrics such as oracle violations and feasible top- $k$ selections remain only partially correlated across material cards.

The cross-material "transfer matrix" shows that dimensionless scaling only partially neutralizes regime bias. While compliant-to-stiff transfers can succeed, the reverse is substantially asymmetric, with models trained on stiffer materials often failing to select feasible candidates or yielding unacceptably high error. Thus, dimensionless formulations align numerical ranges, but transferability of decision utility and constraint adherence is far from universal.

Figure 5: Cross-material transfer remains strongly asymmetric despite dimensionless formulation; error and retrieval rates differ by orders of magnitude between different transfer directions.

Implications and Theoretical Insights

This work establishes that surrogate evaluation in engineering design contexts cannot rely on interpolation accuracy alone. Decision-relevant metrics such as physically admissible top- $k$ retrieval and regret-at- $k$ are indispensable, especially due to the strong feasible/infeasible set distinctions imposed by physical constraints and candidate pool geometry. The benchmark demonstrates that physics-informed losses shape the utility landscape by moving decision boundaries, not merely by improving regression performance.

For future research, this analysis suggests:

Surrogate architectures (including neural operators, GNNs, or meta-learners) must be evaluated using decision-aware, heterogeneous metrics in addition to curve fit.
Loss function design should be explicitly task-targeted, with ablation of individual physics terms, to prevent harmful trade-offs masked by aggregate loss minimization.
Cross-regime and transfer experiments are critical diagnostics for surrogate generalization, and pooled reporting must be complemented with target-specific breakdowns.
Benchmarks like pinn-gym serve as a reproducible, controlled testbed for meaningful method comparison under declaratively defined constraints.

Conclusion

The pinn-gym benchmark (2606.07146) represents a decisive methodological advance for PIML evaluation by recentering surrogate assessment on decision-aware criteria. It provides a transparent, reproducible baseline for quantifying the real-world utility of machine-learned surrogates in material-conditioned design and optimization tasks. The empirical findings—decoupling of regression fit from decision quality, the nuanced role of physics-informed losses, and asymmetric cross-material transfer—highlight that future progress in scientific ML for engineering hinges on evaluation frameworks that mirror end-use demands. This work thus motivates further development of benchmarks, loss functions, and architectures under rigorous, decision-centric scrutiny, with implications for the entire scientific ML and surrogate modeling community.

Markdown Report Issue