Benchmarks to quantify cost–accuracy frontiers for surrogate training

Develop open benchmarks that pair instrumented corpora with surrogate models to quantify when training surrogates amortizes the generation cost of instrumented data and to characterize the resulting cost–accuracy trade-offs.

Background

Generating instrumented data can be expensive for complex simulations; training surrogates is a common approach to amortize this cost. However, the break-even points and trade-offs are not quantified systematically across domains.

Open benchmarks would enable consistent measurement of when and how surrogates achieve acceptable accuracy at reduced cost, guiding corpus design and deployment decisions.

References

Nine open questions will determine whether instrumented data matures into a recognised substrate for scientific machine learning. Cost--accuracy frontiers. Open benchmarks pairing instrumented corpora with surrogates are needed to quantify when surrogate training amortises the substrate's cost.

Instrumented data for causal scientific machine learning  (2606.07865 - Wilke, 5 Jun 2026) in Section 7, Methodological questions for the community, Item 5