PySR: Symbolic Regression for Scientific Discovery

Updated 30 August 2025

PySR is an open-source symbolic regression framework that uses evolutionary algorithms to discover interpretable, high-fidelity scientific equations.
It integrates domain-specific constraints with advanced computational techniques like JIT compilation and symbolic differentiation to optimize model accuracy and interpretability.
PySR recovers empirical laws from noisy data and demonstrates robust performance in fields such as astrophysics, geophysics, and high-performance engineering.

PySR is an open-source symbolic regression framework designed for scientific equation discovery. Leveraging a multi-population evolutionary algorithm and integrated with the Julia-based SymbolicRegression.jl backend, PySR seeks interpretable, high-fidelity analytic expressions that facilitate physical understanding and practical deployment. Its algorithmic architecture is suited to extracting empirical laws from data, integrating domain knowledge through constraints, and enabling robust model discovery under noisy, complex conditions.

1. Algorithmic Foundation and Core Workflow

PySR operates by representing candidate models as expression trees, searching the vast combinatorial space of algebraic forms—numbers, variables, mathematical operators—through parallel, multi-population ("island model") evolutionary algorithms (Cranmer, 2023).

Population Evolution and Candidate Selection

Each population evolves asynchronously, undergoing mutation, crossover, and selection based on a composite scoring function. This function combines prediction loss (e.g., $\ell_2$ norm on residuals) with complexity penalties, often adaptively scaled via frequency-recurrence ("frecency") schemes to encourage exploration of simple models and avoid local minima:

$\ell(E) = \ell_{\text{pred}}(E) \times \exp[\text{frecency}(C(E))]$

Simulated annealing is integrated into the acceptance criterion to favor diversity and cross-population migration periodically injects high-performing individuals for global exploration. After evolutionary steps, candidates are algebraically simplified and constants optimized (e.g., via BFGS using Optim.jl).

Julia Backend: SymbolicRegression.jl

SymbolicRegression.jl provides high-performance computation via JIT compilation, SIMD kernel fusion, automatic differentiation, and distributed parallelization across multicore or cluster resources. User-defined mathematical operators are compiled to machine code; expression evaluation and constant optimization proceed at near C/C++ speeds.

Python Frontend

A scikit-learn-like interface democratizes access for scientific practitioners. Custom loss functions, domain-specific operators, or constraints are readily incorporated via the Python layer.

2. Integration of Background Knowledge and Constraints

A distinguishing capability of PySR is its support for embedding physical or theoretical constraints directly in the regression process (Fox et al., 2023). This is realized via soft penalties in the scoring function that steer the search toward physically meaningful solutions:

For thermodynamic constraints (e.g., isotherm models), requirements such as

$\lim_{p \to 0} f(p) = 0$
$\lim_{p \to 0} \frac{df}{dp} < \infty$
$f'(p) \geq 0$ for $p > 0$

are encoded as Boolean functions using a computer algebra system (Sympy). If an expression violates a constraint, the data-based loss is multiplied by a penalty factor $c_i > 1$ :

$L = \ell_{\text{data}} \times \prod_{i=1}^N c_i^{\delta_i}$

where $\delta_i=1$ if constraint $i$ is violated, $0$ otherwise.

This approach guides the genetic search, penalizing candidates that fail to obey expected physics but still allowing exploration by avoiding "hard" constraint rejection. Algorithmic handling of monotonicity or limit behavior is explicitly implemented (e.g., by domain partition and symbolic differentiation).

3. Benchmarking and Performance

PySR is evaluated against other symbolic regression approaches (AI-Feynman, GPLearn, PySINDy, SymbolicGPT, Operon, etc.) in diverse scientific contexts:

EmpiricalBench

EmpiricalBench is a curated benchmark that quantifies the ability to recover historical empirical equations from noisy, complex datasets typical of the sciences (Cranmer, 2023). PySR demonstrates robust recovery of known functional forms (e.g., Kepler's law, Hubble law) and non-integral constants, balancing accuracy and interpretability across Pareto fronts.

Dynamical Systems Recovery

PySR has consistently achieved high predictive power and accuracy in uncovering governing equations of dynamical systems, such as the Lorenz attractor, nonlinear pendulum, Lotka–Volterra, and compartmental epidemic models (Brum et al., 27 Aug 2025). In many cases, symbolic forms learned by PySR are statistically indistinguishable from canonical analytical models by Wilcoxon signed-rank tests.

Domain-Specific Applications

Applications include:

Cosmology: Model-independent reconstruction of dark energy equation of state $w(z)$ from DESI-DR2 BAO and supernova data, confirming consistency with $\Lambda$ CDM (Sousa-Neto et al., 14 Feb 2025).
QCD: Analytic parameterization of generalized parton distributions $H_{u-d}(x,t)$ , testing factorization hypotheses and extracting hadronic spatial distributions (Dotson et al., 17 Apr 2025).
Geophysics: Derivation of Dst evolution equations in geomagnetic storms, capturing nonlinear dependencies and threshold effects, outclassing traditional empirical models (Markidis et al., 25 Apr 2025).
Multi-messenger astrophysics: Accurate neutron star radius estimation from GW-only observables (mass, tidal deformability) with EOS-independence, matching full TOV solutions to within hundreds of meters (Bejger, 28 Apr 2025).
Galaxy evolution: Star formation rate parameterizations linking gas density, turbulence, and stellar surface density, recovering scaling-relation-like laws in the Kennicutt–Schmidt plane (Salim et al., 7 May 2025).
Particle physics: Extraction of compact, analytic angular observables for precision LHC datasets, interpolating angular coefficients over multidimensional kinematic spaces (Bendavid et al., 1 Aug 2025).

4. Model Selection, Loss Functions, and Interpretability

PySR's multi-objective optimization balances prediction fidelity and parsimony. Selection is performed over the Pareto front, employing scoring metrics that reward substantial loss reduction per complexity increment (e.g., $-\frac{\partial \log L}{\partial c}$ ).

The loss framework can be tailored:

$\ell_2$ or $\ell_1$ for regression problems.
Custom quantile losses for distributional modeling (see Symbolic Quantile Regression, SQR (Hoekstra et al., 11 Aug 2025)).
Domain-informed composite losses combining prediction error and physics-based soft constraints.

Final expressions are displayed as human-interpretable formulas, facilitating direct inspection and theoretical validation. This transparency distinguishes PySR from black-box ML architectures, supporting controlled extrapolation and physical insight.

5. Applications in High-Performance, Resource-Constrained Environments

PySR enables deployment beyond standard scientific analysis:

FPGA and Real-Time Inference

Integration with hls4ml allows PySR-generated symbolic models to be synthesized as firmware for FPGAs (Field-Programmable Gate Arrays), achieving up to 13-fold reduction in inference latency (e.g., down to 5 ns) with $>$ 90% baseline neural-network accuracy (Tsoi et al., 2023). Resource usage is minimized through direct control of operator complexity, with support for latency-aware training (LAT) and mathematical LUT (lookup tables) approximation for expensive functions.

Automated PDE Discovery

ANN-PYSR framework combines PySR with residual attention neural networks to robustly identify governing PDEs from sparse and highly noisy data, providing near two orders-of-magnitude computational speedup and low parameter errors under noise levels up to 200% (Zhang et al., 22 Jun 2025). The symbolic expressions extracted remain interpretable and reveal underlying physical principles even in challenging sensor network scenarios.

6. Extensions and Recent Developments

PySR serves as the baseline for extended frameworks:

LaSR (Learning with a Symbolic Regression framework) integrates zero-shot LLM prompts and a learned concept library to bias genetic search toward abstract, high-quality solutions, outperforming standard PySR on Feynman benchmarks (Grayeli et al., 14 Sep 2024).
eggp adapts genetic operators to exploit e-graph memory, storing all equivalence classes of visited expressions, thus efficiently suppressing redundancy and showing competitive performance with PySR and Operon in SRBench and real-world datasets (Franca et al., 29 Jan 2025).
SQR generalizes PySR to allow direct quantile regression, producing symbolic models for conditional quantiles in high-stakes domains (e.g., airline fuel consumption) that match black-box model accuracy while remaining interpretable (Hoekstra et al., 11 Aug 2025).

7. Computational Trade-offs and Limitations

The integration of background knowledge and computer algebra symbolic constraint checking introduces notable computational overhead—increasing runtime by about an order-of-magnitude in certain regimes (Fox et al., 2023). Complex operator sets, increased population sizes, and high-dimensional data further challenge scalability, necessitating judicious parameter selection and distributed computing.

Hard constraints can overly restrict search space, while poorly chosen soft constraints may hinder effective exploration. While PySR remains robust for moderate expression complexity and population size, alternative frameworks (e.g., Operon and SymbolNet) may exhibit superior scaling or specialized performance in highly dimensional or vectorized input contexts.

8. Impact and Summary

PySR has established itself as a robust, extensible tool for equation discovery, model parameterization, and scientific machine learning. Its algorithmic innovations—multi-population evolutionary search, adaptive loss penalties, domain constraint integration, and high-performance Julia backend—support its widespread use across physical sciences, engineering, and data-driven modeling. The propensity for interpretable analytic output is central to its impact, enabling transparent model inspection, hypothesis testing, and rapid deployment in constrained environments. Ongoing research continues to develop new extensions, integrate LLM guidance, and expand PySR's capabilities for automated scientific discovery.