SIMPLER Benchmark: Unified Evaluation of Simulation Methods

Updated 29 January 2026

SIMPLER Benchmark is a unified framework that standardizes evaluation protocols for both classical integrators and data-driven models on diverse physical simulation tasks.
It integrates canonical dynamical systems with reference solvers and baseline models to assess accuracy, stability, and computational efficiency using consistent metrics.
The framework promotes reproducibility and extensibility through modular system definitions, rigorous evaluation procedures, and open-source tools for seamless experimental configuration.

The SIMPLER benchmark (Systematic, Interoperable, Modular, Physics-agnostic, Level-scalable, Extensible, and Reproducible; Editor's term) refers to a unified framework introduced by Otness et al. for evaluating classical and data-driven methods for simulating physical systems, particularly in the context of scientific machine learning. Motivated by the proliferation of heterogeneous benchmarks, disparate protocols, and outcome metrics in the burgeoning field of data-driven simulation, the SIMPLER paradigm consolidates a small but representative set of canonical dynamical systems, standardizes initial condition generation, supplies reference solvers and baselines, and establishes rigorous evaluation protocols for model accuracy, stability, and computational efficiency (Otness et al., 2021).

1. Motivation and Objectives

Traditional numerical solvers for physical systems—such as finite difference or finite element schemes augmented by matrix-based or iterative time integration—remain robust and general, but are often computationally demanding and may be inadequate where closed-form constitutive models are missing. The emergence of machine learning-based regression models (MLPs, CNNs, kernel machines, graph nets) has driven a need for systematic benchmarking, as prior studies tended to select bespoke testbeds, datasets, and metrics, limiting fair comparison and reproducibility.

The central objectives of the SIMPLER framework are:

To curate a small but diverse set of physical simulation tasks spanning low-dimensional ODEs to high-dimensional, nonlinear PDEs.
To provide reference implementations for both standard time integrators and baseline data-driven methods.
To establish a unified, extensible protocol for evaluating new methods, including metrics for accuracy, stability, and runtime, thereby lowering barriers for reproducible benchmarking and method comparison.

2. Canonical Physical Systems

Each benchmark system is cast after spatial discretization as a first-order ODE: $\dot x(t)=f\!\bigl(x(t)\bigr),\quad x\in\mathbb{R}^N$

Benchmark systems:

System	Dynamics	Features
1D Linear Spring	$\dot q=p,\ \dot p=-q$	2D phase space, ODE
1D Wave Equation	$\partial_{tt}u = c^2\,\partial_{xx}u,\ x\in[0,1)$ ; periodic BC; discretized	125 grid points, PDE
2D Mass–Spring Mesh	$10\times10$ particles, springs:	Mesh, moderate N, nonlinear ODE
	$\dot q_a=p_a,$ $\dot p_a=\sum_{b\in\mathcal{N}(a)}[-k(\\|q_a-q_b\\|-ℓ_{ab})(q_a-q_b)/\\|q_a-q_b\\| -\gamma\,\dot q_a]$
2D Incompressible Navier–Stokes	$\rho\,\partial_t u + \rho (u\cdot\nabla) u - \nu\Delta u + \nabla p = b,\ \nabla\cdot u=0$	Unstructured FEM, large N, PDE

Each system provides a range of dynamical features (linearity, dimensionality, Hamiltonian/energy structure, dissipation, boundary conditions) representative of common challenges in physical simulation. Initial conditions are sampled using standardized protocols—including phase-space circles (spring), pulses via cubic-spline kernels (wave), and random disk displacements or obstacle placements (mesh, NS).

3. Baseline Methods: Numerical Integrators and Data-Driven Models

The SIMPLER suite encompasses both classical time integrators and physics-agnostic regression baselines.

Classical Integrators:

Forward Euler: $\displaystyle x_{k+1}=x_k+\Delta t\,f(x_k)$ ; explicit, first order, conditionally stable.
Leapfrog: Symplectic, second-order; specialized for Hamiltonian systems.
Runge–Kutta 4 ( $\mathrm{RK}4$ ): Fourth order, explicit; balances accuracy and step size.
Backward Euler: $\displaystyle x_{k+1}-\Delta t\,f(x_{k+1})=x_k$ ; implicit, first order, stable for stiff systems.
BDF2: Second-order, implicit multistep; suited for diffusion/stiffness.

Data-Driven Baselines:

Nearest Neighbor (KNN): $k=1$ regression; serves as a memorization baseline.
Random-feature Kernel Ridge Regression ("nn–kernel"): ReLU random features ( $L=32,768$ ) with ridge regression.
MLPs: Depth 2–5, width up to 4096; tanh activations, Adam optimizer.
CNNs: 5 layers, $9\times9$ kernels, up to 64 channels, no pooling.
U-Net (Navier–Stokes only): Encoder–decoder with skip connections, upsampling to $256\times256$ for full depth.

Training is performed on pairs $(x_k,\dot x_k)$ (derivative learning) or $(x_k,x_{k+1})$ (step learning) with mean-squared error (MSE) objective, batch sizes 32–64, 250–800 epochs. For Navier–Stokes, small Gaussian noise is injected to promote rollout stability.

4. Evaluation Protocols and Metrics

Evaluation rigorously compares methods by solution error, stability, and efficiency:

Per-step MSE: $\,\mathrm{MSE}(k)=\frac{1}{N}\sum_{j=1}^N \|\hat x_j^{(i)}(t_k)-x_j^{(i)}(t_k)\|^2$
Per-trajectory error: Averaged over trajectory steps.
Aggregate statistics: Distribution of per-trajectory MSE over the held-out evaluation set (box plots).
Stability: Long rollout boundedness; detection of blowup in learned predictors.
Computational efficiency: Per-step runtime (CPU/GPU), and speed-up factor relative to classical integration for matched accuracy (i.e., how large a $\Delta t$ an integrator can use before equaling learned model error).

No further data preprocessing is performed except noise injection on the NS system. Training, evaluation, and out-of-distribution assessment (by shifting initial condition ranges) are standardized.

5. Experimental Design and API Extensibility

The framework includes multi-scale training set sizes (e.g., spring: 10/500/1000 trajectories; NS: 25/50/100), with each larger set strictly containing the previous. Dataset generation uses fine integrator steps with subsequent subsampling to define "ground truth" snapshots.

Modularity and configuration:

Each system is defined as a Python/C++ module supporting initial condition sampling, integrator rollout, and storing metadata and arrays (".npz").
Experiments are orchestrated via JSON "run description" files, managed by scripts (manage_runs.py) capable of launching, monitoring, and logging jobs, including interfacing with batch systems.
Configuration includes system, parameters (e.g., grid size, viscosity), integrator, learning task (derivative vs. step), architecture, set size, and random seed.
Full reproducibility is enabled by tracking dependencies (via environment.yml) and providing an optional Singularity container recipe.

Adding a new physical system or model requires implementing new modules/classes and integrating them with the code generation and evaluation pipeline.

6. Empirical Results and Main Findings

Key findings are:

Precision of numerical integrators: High-order (or symplectic) integrators at moderate $\Delta t$ remain strictly more accurate than any learned model across all canonical systems.
Superiority of KNN in data-scarce regimes: In systems with low-dimensional initial condition spaces (e.g., NS around a single obstacle), KNN memorization is competitive, and in some scenarios outperforms small MLPs and CNNs.
Model saturation: Data-driven neural networks (MLP, CNN) plateau rapidly in error despite increasing data, while KNN and kernel methods exhibit further gains, suggesting expressiveness or optimization limitations in standard network architectures for certain scientific data distributions.
Canonical regression tasks: Derivative-learning (predicting $f(x)$ ) vs. step-learning (predicting $x_{k+1}$ ) yield distinct numerical behaviors; step learners can mitigate integration error but are susceptible to artifacts, especially under global coordinate representations (e.g., CNN on mesh in step mode).
Computational overhead: Learned models are 10–100 $\times$ slower per step than explicit integrators; KNN incurs additional runtime due to $O(N)$ search. Most neural baselines require small $\Delta t$ to reach the accuracy of even coarse time-step traditional solvers.

Sample MSE statistics (in-distribution, smallest vs. largest training set):

System	Euler/KNN/MLP/CNN/U-Net (MSE, smallest $\to$ largest dataset)
Spring	Euler: $10^{-5}$ , KNN: $10^{-3}\to10^{-4}$ , MLP: $10^{-2}$
Wave	KNN: $10^{-3}\to10^{-4}$ , MLP: $10^{-2}$ , RK4 ( $\Delta t\times 8$ ): $10^{-2}$
Spring–mesh	CNN: $10^{-1}$ , KNN: $10^1$ , Euler (subsample 8): $10^{-2}$
Navier–Stokes	U-Net: $10^{-1}$ , KNN: $10^0$ , BDF3: $10^{-2}$

A plausible implication is that improved model architectures, increased dataset diversity, or physics-aware inductive biases would be needed to close the gap with carefully tuned numerical integrators in the general case.

7. Availability and Research Impact

All code, datasets, and experiment scripts are public under permissive open-source licenses in the NYU Faculty Digital Archive and at https://github.com/karlotness/nn-benchmark. The modular, systematic structure supports rapid extension across physical domains or ML architectures.

The SIMPLER/nn-benchmark methodology provides a rigorous, extensible reference platform for the comparative study of simulation algorithms—enabling the community to probe learning vs. classical trade-offs, identify failure modes, and assess true computational advantages across a unified set of tasks and metrics (Otness et al., 2021).

Markdown Upgrade to Chat

References (1)

An Extensible Benchmark Suite for Learning to Simulate Physical Systems (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SIMPLER Benchmark.

SIMPLER Benchmark: Unified Evaluation of Simulation Methods

1. Motivation and Objectives

2. Canonical Physical Systems

3. Baseline Methods: Numerical Integrators and Data-Driven Models

4. Evaluation Protocols and Metrics

5. Experimental Design and API Extensibility

6. Empirical Results and Main Findings

7. Availability and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SIMPLER Benchmark: Unified Evaluation of Simulation Methods

1. Motivation and Objectives

2. Canonical Physical Systems

3. Baseline Methods: Numerical Integrators and Data-Driven Models

4. Evaluation Protocols and Metrics

5. Experimental Design and API Extensibility

6. Empirical Results and Main Findings

7. Availability and Research Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research