Symbolic Regression Benchmarks

Updated 29 December 2025

Symbolic regression benchmarks are structured protocols that rigorously compare algorithms based on metrics like accuracy, complexity, and interpretability.
They incorporate diverse datasets—including black-box, first-principles, and dummy-variable tests—to ensure fair, reproducible evaluation across scientific and engineering domains.
Modern benchmarks utilize multi-objective criteria, balancing model expressiveness, computational resources, and energy consumption for robust method assessment.

Symbolic regression benchmarks are structured, reproducible experimental protocols and curated dataset-method suites designed to rigorously evaluate and compare the performance of symbolic regression (SR) algorithms. SR benchmarks serve as the primary infrastructure for assessing advances in automated, interpretable model discovery, guiding both method development and application in scientific and engineering domains. Contemporary benchmarks encompass a diverse range of algorithms—spanning GP-based, deterministic, neural, Bayesian, hybrid, and exhaustive-search paradigms—evaluated across taxonomically broad black-box, real-world, and first-principles datasets. The modern benchmarking landscape emphasizes not just accuracy, but explicit trade-offs between expressiveness, model complexity, computational (and energy) cost, and alignment with human interpretability.

1. Evolution of Symbolic Regression Benchmarks

The historical trajectory of SR benchmarking began with ad hoc, handpicked small suites such as the Koza, Nguyen, and Keijzer functions. The emergence of open-source, large-benchmark platforms such as SRBench (Cava et al., 2021), PennMLB (&&&1&&&), and the SRSD-Feynman suite (Matsubara et al., 2022), marked a shift toward reproducible, extensible, and taxonomically diverse datasets. Early benchmarks focused on accuracy (e.g., mean squared error, $R^2$ ), but recent generations have shifted toward multi-objective frameworks that systematically quantify the complexity–accuracy–interpretability–resource trade-space. SRBench 2.0 further delivers a “living benchmark” approach with periodic deprecation, extensibility, unified APIs, and formalized hardware constraints, positioning itself as the community standard for state-of-the-art evaluation (Aldeia et al., 6 May 2025).

2. Dataset Construction and Taxonomy

Modern benchmarks organize datasets along multiple axes to ensure that evaluation is not biased toward any single function class or domain:

Black-box track: Datasets from the PMLB repository, including classical regression benchmarks, biomedical, engineering, and synthetic tasks of varying sample and feature counts. Selection uses meta-feature profiling (number of samples, number of features, prior SR performance), dimensionality reduction (t-SNE), and k-means clustering to obtain a representative, non-redundant subset (e.g., 12 out of 122 in SRBench 2.0). It is mandated that ≤25% are Friedman-type to avoid synthetic-data bias (Aldeia et al., 6 May 2025, Aldeia, 1 Dec 2025).
First-principles track: Real-world and physics-inspired problems with known ground-truth formulae (Kepler’s law, Hubble’s law, ideal gas law, etc.) and challenging characteristics (small sample sizes, non-Gaussian noise, nontrivial codomain structure). Typical sources include Feynman, multiviewSR, and extensions curated for disciplinary coverage (Aldeia et al., 6 May 2025, Aldeia, 1 Dec 2025).
Dummy-variable stress tests: Injected irrelevant features to assess variable-selection robustness (Matsubara et al., 2022).

Each dataset is fully numeric, missing-value-free, standardized pre-training, and accompanied by domain-specific sampling protocols to guarantee challenge authenticity.

3. Algorithm Inclusion and Benchmark Protocol

Contemporary benchmarks evaluate broad algorithmic families:

Algorithm Type	Examples	SRBench 2.0 Inclusion
GP-based	AFP, EPLEX, PySR, GP-GOMEA, ITEA	Yes
Grammar-guided or Deterministic	Genetic Engine, FFX, SymTree, FFX NLS	Yes
Deep learning/Transformer	E2E, TPSR, NeSymRes, uDSR	Yes
Bayesian and Hybrid	BSR, AIFeynman	Yes
Local/Iterative Search	RILS-ROLS	Yes
Exhaustive/Brute-force	Kammerer et al., Bartlett et al.	Yes

Each method details: constant-parameter optimization support, Pareto support (multiple vs. single solution), time-limit compliance, hardware needs, and Python binding status. Hyperparameter optimization uses grid-search (≈4 grid points plus “off-the-shelf” baseline), with resource limits of 6h (tuning) and 1h (final fit) per problem, CV splits, and 1-CPU/10-GB RAM standardization (Aldeia et al., 6 May 2025).

4. Evaluation Metrics: Accuracy, Complexity, and Energy

Precision in metric formulation is crucial for comparability:

Accuracy:
- Coefficient of Determination:
$R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat y_i)^2}{\sum_{i=1}^n (y_i - \bar y)^2}, \quad \bar y = \frac1n\sum_{i=1}^n y_i$ - Root-Mean-Square Error:

$\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat y_i)^2}$
Complexity:
- Node Count: Number of nodes in the SymPy parse tree after conversion (no aggressive simplification) (Aldeia et al., 6 May 2025).
- Description Length: Optional; sum of linear notation symbols.
Energy Consumption:
- Using eco2AI, report total energy (kWh/J) and per-evaluation energy:
$E_{\mathrm{total}} = \int_{t_0}^{t_{\mathrm{end}}} P(t) \,dt,\qquad E_{\mathrm{eval}} = \frac{E_{\mathrm{total}}}{N_{\mathrm{eval}}}$

with $P(t)$ instantaneous power; $N_{\mathrm{eval}}$ the number of fitness evaluations (Aldeia et al., 6 May 2025).
Symbolic Accuracy (Ground-Truth Benchmarks):
- Exact match or algebraic equivalence via SymPy or curated sets of expressions (Martinek, 20 Aug 2025).
- Tree Edit Distance (TED): Minimum sequence of insertions/deletions/substitutions on expression trees (normalized in NED as in (Matsubara et al., 2022)).
Pareto Front & Performance Profiles:
- Pareto front for error vs. node count; marks ground truth for first-principles tasks.
- Distribution-aware plots (performance profiles, AUC across $R^2$ thresholds) to summarize multi-run outcomes and robustness (Aldeia et al., 6 May 2025).

5. Methodological Innovations and Best-Practice Principles

Recent research articulates several methodological advances:

Curated Acceptable Expression Sets: FastSRB (Martinek, 20 Aug 2025) introduces functionally equivalent expression checklists and early-termination callbacks, raising rediscovery rates substantially and saving computational expense.
Dynamic and Multi-objective Selection: Dynamic $\varepsilon$ -lexicase selection and NSGA-II are advocated for building robust Pareto fronts and avoiding overfitting in multi-criterion settings (Aldeia, 1 Dec 2025).
Hyperparameter Standardization: Small, curated grids and built-in tuning mechanisms (e.g., Optuna, multi-armed bandit-based approaches in PySR, Operon) ensure both fairness and practical usability (Aldeia et al., 6 May 2025).
Deprecation Protocols: Inactivity and chronic Pareto inferiority are formal criteria for method exclusion from living benchmarks.
Energy Efficiency: Early stopping, minimized pythonic inefficiencies, and resource capping are now integral to best-practice guidance (Aldeia et al., 6 May 2025).

6. Critical Assessment, Trade-offs, and Limitations

Rigorous meta-analyses and comparative studies establish key insights:

No Universal Winner: On both black-box and first-principles tasks, no single algorithm dominates in both accuracy and interpretability across all datasets. Methods vary in performance by dataset class, dimensionality, and noise structure (Aldeia et al., 6 May 2025, Aldeia, 1 Dec 2025).
Multi-objective Efficacy: Pareto optimization and dynamic selection yield models closer to the efficiency frontier, revealing nuanced trade-offs in model accuracy vs. complexity (Aldeia, 1 Dec 2025).
Robustness to Formulation: Benchmarks with curated equivalence (FastSRB), dummy variable injection, and NED metrics expose method weaknesses in algebraic over-constraining or spurious variable inclusion (Martinek, 20 Aug 2025, Matsubara et al., 2022).
Dependence on Resource Limits: Variability in runtime, memory, and energy usage can yield divergent outcomes in practical settings; enforcing strict, documented constraints is now standard.
Metric Biases: Exclusive reliance on $R^2$ or complexity can obscure symbolic correctness; combined error, NED, and size analysis is essential (Reis et al., 2024).
Domain-specific Relevance: Generic, “toy” benchmarks may poorly predict performance on real-world or scientific problems (e.g., in astrophysics, physical law recovery; see cp3-bench findings (Thing et al., 2024)).

7. Outlook: Living Benchmarks and Future Directions

The field converges toward several consensus directions:

Continual Benchmark Curation: Periodic expert-led pruning and expansion, using community-driven APIs and containerized environments, is required to maintain relevance and reproducibility (Aldeia et al., 6 May 2025).
Domain-aware Expansion: Including phenomenologically grounded data, variable noise levels, and semantic metadata ensures broad applicability.
Interpretability and Scientific Utility: Explanatory analysis (feature importance, post-hoc explainers, e.g., Partial Effects, SHAP (Aldeia et al., 2024)) are now benchmarked in tandem with model discovery.
Sustainability and Green AI: Energy tracking and the pursuit of efficient algorithmic implementations are now standard research objectives (Aldeia et al., 6 May 2025).

In summary, symbolic regression benchmarking has matured into a scientific subdiscipline, emphasizing rigor in dataset diversity, algorithmic inclusion, metric design, and reproducibility infrastructure. SRBench 2.0 and related frameworks set community standards for evaluating and evolving SR algorithmic landscapes, while recent methodological innovations ensure these benchmarks remain reflective of both research advances and practical deployment constraints (Aldeia et al., 6 May 2025, Aldeia, 1 Dec 2025, Martinek, 20 Aug 2025).