Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 32 tok/s Pro
GPT-4o 95 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

PDEGym Benchmark Overview

Updated 5 September 2025
  • PDEGym Benchmark is a standardized, extensible suite that quantitatively evaluates ML surrogates on diverse PDE simulation tasks.
  • It uses high-fidelity datasets, modular APIs, and rigorous performance metrics to ensure reproducibility and physical consistency.
  • The benchmark enables head-to-head comparisons between classical solvers and state-of-the-art ML models, driving advances in scientific machine learning.

A PDEGym Benchmark is a standardized, extensible, and challenging suite designed for the quantitative evaluation of ML models on a wide range of partial differential equation (PDE) simulation tasks. It encompasses high-fidelity datasets, modular code, and rigorous performance metrics to facilitate head-to-head comparisons between classical numerical solvers and state-of-the-art ML surrogates. Such benchmarks play a foundational role in advancing scientific machine learning by providing controlled, reproducible problem settings and holistic metrics that capture both data accuracy and physical consistency.

1. Purpose and Design Principles

PDEGym Benchmarks are constructed to address deficiencies in prior benchmarks: limited diversity of PDE types, insufficient dataset scale, poor extensibility, and lack of physically meaningful metrics. These frameworks provide:

  • Broad coverage of time-dependent and time-independent PDEs, including systems exhibiting shock waves, turbulence, and nonlinear behavior.
  • Extensive, ready-to-use datasets, typically covering multiple regimes of initial conditions, boundary conditions, and physical parameters.
  • Modular, user-friendly APIs (e.g., PyTorch, JAX) that allow for streamlined data loading, model training, and evaluation.
  • Baseline ML model implementations (such as U-Net, Fourier Neural Operator, PINN) for comparative analysis and reproducibility.
  • Extensible infrastructure, facilitating the community’s addition of new problems, parameter settings, and baseline methods.

This approach ensures that benchmarking is both rigorous and adaptable to emerging scientific ML techniques.

2. Scope of PDE Problems and Dataset Construction

A PDEGym Benchmark spans a wide range of equations and physical scenarios:

Dimension Example PDE Problems Physical Regimes
1D Advection, Burgers’, Diffusion-Reaction Nonlinear transport, diffusion
2D Diffusion-Reaction, Darcy Flow, Shallow Water Pattern formation, heterogeneous media
3D Compressible Navier–Stokes, Turbulence Shock, turbulence, boundary-layer phenomena

Datasets are systematically generated using highly accurate numerical solvers, stored in standardized formats (e.g., HDF5), with typical sizes up to 10,000 samples per task and resolutions like 1,0241{,}024 spatial cells or 512×512512\times512 grids for 2D cases. Each dataset varies:

  • Initial conditions (sinusoidal, random noise)
  • Boundary conditions (periodic, Dirichlet, Neumann, out-going)
  • Physical parameters (diffusion coefficient, viscosity, advection speed, etc.)

These datasets enable interpolation and extrapolation studies for ML surrogates, including forward prediction and inverse inference tasks (e.g., recovering initial states from terminal data).

3. Software API and Extensibility

Benchmarks leverage modular, extensible APIs for data management and model integration:

  • Standardized data access routines (e.g., via pyDaRUS, PyTorch Dataset or JAX DataLoader) encapsulate real-world simulation tasks. For instance:

1
2
3
from pyDaRUS import Dataset
pid = "doi:10.18419/darus-2986"
dataset = Dataset.fromdataversedoi(pid, filedir="data/")

  • Configurable pipelines (integrated with tools like Hydra) enable users to adjust solver parameters, grid resolution, or add new PDE problems by extending source modules.
  • Baseline training scripts and precomputed results facilitate reproducible comparisons.

Datasets are archived following FAIR principles and assigned permanent DOIs, promoting accessibility and interoperability.

4. Evaluation Metrics and Physical Consistency

PDEGym Benchmarks introduce a suite of metrics designed for holistic assessment:

Metric Formula / Definition Physical Interpretation
RMSE upredutrue2||u_{\mathrm{pred}} - u_{\mathrm{true}||_2} L2 error over domain
nRMSE upredutrue2/utrue2||u_{\mathrm{pred}} - u_{\mathrm{true}||_2} / ||u_{\mathrm{true}||_2} Normalized global error
cRMSE upredutrue2/N||\sum u_{\mathrm{pred}} - \sum u_{\mathrm{true}}||_2 / N Conservation error
bRMSE Error evaluated solely on boundary nodes Boundary condition fit
fRMSE RMSE in Fourier space, bandwise Scale-resolved (low/med/high freq) error

For inverse problems, errors on initial condition recovery (u0u_0) and subsequent dynamics are computed. Physics-view metrics (such as cRMSE and fRMSE) ensure ML surrogates capture essential physical invariants and dynamical features beyond mere data fitting.

5. Experimental Challenges and Observed Limitations

Benchmarks reveal several domains where ML surrogates are currently challenged:

  • High-frequency phenomena (e.g., shocks, turbulence): Frequency-domain errors remain large, and effects such as Gibbs artifacts are observed (notably with FNOs).
  • Sensitive inverse problems: Recovery tasks magnify small errors due to nonlinearity and parameter sensitivity.
  • Temporal extrapolation: Autoregressive rollouts suffer from accumulated errors, especially over long horizons.
  • Variable input scales: Certain equations, such as Darcy flow or diffusion-sorption systems, necessitate methods robust to large variations in the magnitude and scale.

These observations motivate the development of more robust ML architectures, multi-scale design strategies, and physically constrained loss functions.

6. Comparison with Previous Benchmarks

Relative to prior work, a PDEGym Benchmark advances on several fronts:

  • Broader Coverage: Includes more PDEs (e.g., 11 core types in PDEBench with >>35 dataset variants), covering realistic physics and parameter regimes.
  • Usability: Offers open-source codebases, thorough documentation, and integration with mainstream ML frameworks.
  • Metrics Innovation: Embeds physical invariants and conservation properties as central evaluation criteria, rather than relying solely on global regression metrics.
  • Reproducibility: Datasets and results are standardized and archived for persistent accessibility.
  • Experimental Protocols: Facilitates comprehensive ablation and generalization analyses (across initial conditions, resolutions, and temporal windows).

This institutionalizes reproducibility, extensibility, and physically meaningful evaluation, setting a new standard for scientific ML benchmarking.

7. Future Directions and Community Challenges

Ongoing development areas highlighted by PDEGym benchmarks include:

  • Enhanced multi-scale and frequency-adaptive ML architectures to address high-frequency prediction errors.
  • Inverse models capable of robustly inferring hidden states/parameters from limited observational data.
  • Expansion to more heterogeneous and complex PDE systems (e.g., multiphase flows, irregular domains).
  • Embedding physical law enforcement (e.g., conservation, boundary constraints) directly within neural architectures or loss functions.

Addressing these challenges will further close the gap between ML-based surrogates and classical numerical methods, supporting broader application to scientific and engineering simulation tasks.

Conclusion

A PDEGym Benchmark provides a comprehensive, reproducible, and physically rigorous foundation for the evaluation and development of machine learning surrogates in PDE-based simulation tasks. By unifying diverse PDE scenarios, high-quality datasets, modular software components, and holistic metrics, it catalyzes progress in scientific machine learning and enables systematic, head-to-head comparisons across modeling paradigms. The benchmark infrastructure’s extensibility and attention to physical consistency ensure its relevance for both present research frontiers and future innovation trajectories in computational science (Takamoto et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)