SymbolicRegression.jl: Symbolic Regression in Julia

Updated 21 September 2025

SymbolicRegression.jl is a high-performance Julia library for symbolic regression that uses multi-population evolutionary algorithms to generate human-interpretable models.
It employs an evolve–simplify–optimize loop, combining genetic operators and gradient-free local optimizers to balance accuracy and model complexity.
The library supports distributed computation, runtime operator fusion, and integration with Python, enhancing scalable scientific model discovery.

SymbolicRegression.jl is a high-performance, open-source Julia library for symbolic regression, implementing a range of modern search algorithms to discover compact, human-interpretable mathematical models directly from data. Positioned at the intersection of machine learning, optimization, and scientific modeling, SymbolicRegression.jl is the engine behind PySR and supports distributed computation, automatic differentiation, and runtime operator fusion. It is widely used in the sciences for empirical model discovery, benchmark development, and the recovery of interpretable closed-form laws from complex datasets.

1. Core Algorithms and Methodology

SymbolicRegression.jl primarily employs a multi-population evolutionary algorithm based on the evolve–simplify–optimize loop (Cranmer, 2023). This algorithm evolves a population of symbolic expression trees through genetic operators (mutation, crossover, reproduction), with populations distributed across worker processes for scaling to thousands of CPU cores. Each candidate expression is represented as a tree constructed from a user-configurable set of unary and binary operators (e.g., $+, -, \times, /, \sin, \exp$ ), constants, and variables. Search proceeds via:

Evolve: Randomized variations of the population through subtree crossover and mutation, with parsimony pressure via explicit complexity constraints.
Simplify: Syntactic (constant-folding, algebraic reductions) and semantic simplification of expressions at each step to favor compact and interpretable forms.
Optimize: Scalar constants within candidate expressions are numerically optimized (“constant optimization”) using gradient-free local optimizers (e.g., Levenberg–Marquardt or BFGS) to improve data fit after symbolic manipulation.

The fitness function is a multi-objective optimization combining accuracy (e.g., mean squared error, R²) and expression complexity. Pareto fronts are constructed to allow users to select trade-offs between predictive quality and interpretability.

SymbolicRegression.jl supports runtime fusion of user-defined operators into SIMD kernels and can perform automatic differentiation on candidate expressions, facilitating scalable stochastic gradient-based optimization when used within differentiable programming frameworks (Cranmer, 2023). The package is also designed for seamless integration with scientific Python libraries via the PySR Python interface.

2. Scientific Applications and Benchmarking

SymbolicRegression.jl has been developed with a focus on empirical scientific discovery, enabling rapid extraction of governing equations from observational or simulated data. Key use cases include:

Rediscovering empirical laws from the Feynman Lectures, Strogatz systems, and Livermore problems (Cranmer, 2023).
Materials science applications, such as extracting transformation kinetics and energy functionals directly from experiment or DFT simulations (Wang et al., 2019).
Industrial process modeling (e.g., distillation towers, engine operations) and failure modeling in infrastructure systems (Wang et al., 2019).
Physics beyond the Standard Model: mapping high-dimensional parameter spaces in supersymmetric models to key observables such as Higgs mass or relic density, achieving dramatic speedups over full simulation pipelines (AbdusSalam et al., 28 May 2024).

A standardized benchmarking workflow is provided (EmpiricalBench and extensions), with curated lists of functionally equivalent solutions to robustly assess rediscovery rates and early termination callbacks for computational efficiency (Martinek, 20 Aug 2025).

3. Algorithmic Innovations and Extensions

SymbolicRegression.jl incorporates several state-of-the-art algorithmic advancements:

Multi-objective Evolutionary Search: Population migration and Hall of Fame retention to maintain diversity and accuracy across clusters of CPUs (Cranmer, 2023).
Symbolic Constant Optimization: Decoupling the optimization of structure from that of constants, enabling rapid convergence on physically plausible formulas even in noisy datasets.
User-Defined Operators and SIMD Fusion: Dynamic operator dictionaries allow domain tailoring, while LLVM JIT compilation achieves high efficiency by fusing symbolic kernels.
Domain-Aware Symbolic Priors: Recent methods propose extracting domain-specific operator priors (from physics, chemistry, engineering corpora) and integrating them into the search via tree-structured RNNs, KL-regularized policies, and characteristic expression blocks (Huang et al., 12 Mar 2025).
Hybrid and MINLP Approaches: SymbolicRegression.jl can be informed by global optimization techniques such as MINLPs for globally optimal searches under moderate problem sizes (Austel et al., 2017, Kim et al., 2021). Deterministic local improvement routines also offer enhanced reproducibility (Rivero et al., 2019).

Table: SymbolicRegression.jl Search Capabilities

Feature	Description	Reference
Evolutionary Search	Multi-population, Pareto front, constraint-driven	(Cranmer, 2023)
Constant Optimization	Hybrid local FFX / BFGS on constants	(Wang et al., 2019)
Operator Extensibility	Arbitrary user-defined, SIMD compiled	(Cranmer, 2023)
Distributed Evaluation	Populations over 10³+ CPU cores	(Cranmer, 2023)
Domain Priors (Planned/Available)	Tree-RNN, physics-aware constraints	(Huang et al., 12 Mar 2025, AbdusSalam et al., 28 May 2024)

4. Performance, Complexity, and Limitations

Symbolic regression is NP-hard, presenting inherent computational barriers to global optimality (Virgolin et al., 2022). SymbolicRegression.jl employs approximation strategies—evolutionary population search, regularization for parsimony, and parallel computing—for practical performance on moderate- to large-scale datasets. The hybridization of heuristics and local optimization permits robust extraction of closed-form models, but global guarantees (as with MINLP) are only available for small-scale settings. Benchmark studies highlight a rediscovery rate of 44.7% on scientific discovery tasks (up from 26.7% with legacy metrics) and around 41% computational savings from early stopping via adaptive callbacks. Competing frameworks, such as TiSR, report higher rediscovery (69.4%) and greater time savings (63%), suggesting that further algorithmic improvement remains possible (Martinek, 20 Aug 2025).

A plausible implication is that integrating Bayesian or probabilistic post-selection (Guimera et al., 22 Jul 2025), pre-trained deep generative models exploiting algebraic invariances (Holt et al., 2023), or discrete diffusion-based generative search (Bastiani et al., 30 May 2025) could enhance both reliability and coverage on challenging rediscovery benchmarks.

5. Integration with Emerging Techniques

SymbolicRegression.jl is actively influenced by advances in deep learning and probabilistic modeling:

Transformer and Set-Encoder Models: Pre-training conditional sequence decoders (e.g., Transformers) on large corpora of synthetic formulas is shown to yield scalable, data-adaptive symbolic search and improved extrapolation (Biggio et al., 2021, Holt et al., 2023).
Diffusion and RL-based Generation: Discrete diffusion models with token-wise Group Relative Policy Optimization enable diverse, risk-seeking equation generation outperforming classical GP methods in solution rate and simplicity (Bastiani et al., 30 May 2025).
Bayesian Model Selection: Probabilistic post-selection based on marginal likelihood penalization (rather than explicit complexity terms) provides information-theoretic guarantees, ensemble predictions, and uncertainty quantification, with symbolic regression cast as a model-selection problem under a posterior (Guimera et al., 22 Jul 2025).
Domain Priors and Constraint Integration: Guidance from curated corpora via statistical operator priors and hierarchical search architectures sharply improves convergence and faithfulness to domain-expected formula classes (Huang et al., 12 Mar 2025).

These techniques open new directions for modular, hybrid workflows in SymbolicRegression.jl, where symbolic search, neural priors, hybrid local/global optimization, and ensemble uncertainty come together in a unified model discovery architecture.

6. Future Directions

Potential developments for SymbolicRegression.jl include:

Deeper integration of neural guidance (transformers, tree-structured RNNs) and domain priors for scientific domains where operator use and formula structure are predictable (Huang et al., 12 Mar 2025, Biggio et al., 2021, Holt et al., 2023).
Incorporation of discrete diffusion–reinforcement learning hybrids for robust diversity and risk-seeking exploration in expression space (Bastiani et al., 30 May 2025).
Bayesian model weighting, ensemble prediction, and built-in uncertainty quantification (Guimera et al., 22 Jul 2025).
Improved benchmarking and auto-termination protocols for efficient evaluation (Martinek, 20 Aug 2025).
Hybridization with global MINLP search and deterministic improvement for small and moderate-scale, high-accuracy tasks (Austel et al., 2017, Kim et al., 2021, Rivero et al., 2019).
Applications beyond deterministic function discovery, including quantum circuit proposal, dynamic and probabilistic programming, and physics-constrained model synthesis (AbdusSalam et al., 28 May 2024).

SymbolicRegression.jl thus represents both a mature platform for interpretable machine learning in science and a rapidly evolving research testbed for algorithmic innovations in symbolic model discovery.