On the Generalization Bounds of Symbolic Regression with Genetic Programming

Published 19 Apr 2026 in cs.LG and cs.NE | (2604.17402v1)

Abstract: Symbolic regression (SR) with genetic programming (GP) aims to discover interpretable mathematical expressions directly from data. Despite its strong empirical success, the theoretical understanding of why GP-based SR generalizes beyond the training data remains limited. In this work, we provide a learning-theoretic analysis of SR models represented as expression trees. We derive a generalization bound for GP-style SR under constraints on tree size, depth, and learnable constants. Our result decomposes the generalization gap into two interpretable components: a structure-selection term, reflecting the combinatorial complexity of choosing an expression-tree structure, and a constant-fitting term, capturing the complexity of optimizing numerical constants within a fixed structure. This decomposition provides a theoretical perspective on several widely used practices in GP, including parsimony pressure, depth limits, numerically stable operators, and interval arithmetic. In particular, our analysis shows how structural restrictions reduce hypothesis-class growth while stability mechanisms control the sensitivity of predictions to parameter perturbations. By linking these practical design choices to explicit complexity terms in the generalization bound, our work offers a principled explanation for commonly observed empirical behaviors in GP-based SR and contributes towards a more rigorous understanding of its generalization properties.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a generalization bound for symbolic regression models, decomposing the impact of tree structure and parameter optimization.
It mathematically quantifies how constraints on tree size, depth, and constant sensitivity help control overfitting.
The results justify the practical effectiveness of parsimony pressure, depth limiting, and numerical stability in GP-based symbolic regression.

Generalization Bounds for Symbolic Regression with Genetic Programming

Overview

This paper presents a formal statistical learning analysis of symbolic regression (SR) conducted via Genetic Programming (GP). The central contribution is a generalization bound for SR models explicitly represented as expression trees, controlled by tree size, depth, and the treatment of learnable constants. The analysis decomposes the generalization gap into two terms: one reflecting combinatorial complexity from structure selection, and the other capturing sensitivity to the optimization of numerical constants. This decomposition rigorously explains why commonly-used mechanisms in GP—such as parsimony pressure, depth limiting, numerically stable operators, and interval arithmetic—are effective at restraining overfitting and promoting generalization.

Expression Tree Model and Complexity Controls

The GP-based SR paradigm is described in terms of predictors parameterized by expression trees. Each tree comprises operators as internal nodes and variables or constants as leaves, with an explicit partition between fixed and learnable constants. To focus on hypothesis classes with finite VC dimension and controllable complexity, both the size and depth of trees are upper bounded ( $s$ and $D$ , respectively), and the $\ell_2$ -norm of the learnable constants is constrained within a radius $R$ .

The full hypothesis class thus considered is the union over all admissible tree topologies (under size and depth budgets), with parameters constrained to the prescribed ball. This sets the stage for a complexity analysis sensitive to both discrete (symbolic) and continuous (parametric) sources of capacity.

Main Theoretical Results

The paper's central theorem establishes a generalization bound which holds uniformly over all models in the considered class. With high probability over the sample $S \sim D^m$ , every $f$ in the class satisfies:

$L(f) \leq L_S(f) + C_1\, R\, G \sqrt{\frac{s}{m}} + C_2\, B \sqrt{\frac{\log|T_{s,D}|}{m}} + C_3 \sqrt{\frac{\log(1/\delta)}{m}}$

Key aspects:

The first term ( $R G \sqrt{s/m}$ ) quantifies the cost of fitting learnable constants, scaling with the tree size $s$ , Lipschitz parameter $G$ reflecting parameter sensitivity, and the parameter norm upper bound $D$ 0.
The second term ( $D$ 1) quantifies the combinatorial burden of structure selection, with $D$ 2 describing the count of feasible trees.
The analysis yields explicit asymptotics for $D$ 3 in the regime of trees with arity at most 2, showing that depth constraints sharply reduce hypothesis class growth due to the exponential base $D$ 4.
The third term reflects the confidence adjustment.

These results rest on the foundational tools of empirical Rademacher complexity and Dudley’s entropy integral, with union bounds applied over the finite set of possible tree structures.

Practical Implications for Design Heuristics

The explicit decomposition allows direct mapping between practical SR strategies and their theoretical justification.

Parsimony Pressure and Depth Limiting: Structural capacity is tightly linked to $D$ 5. Mechanisms that penalize large or deep trees directly reduce the dominant $D$ 6 contribution, lowering overfitting risk. The bound quantifies the algorithmic incentive for bloat control and the effect of imposing strict maximum depths.
Numerical Stability and Sensitivity Control: The constant-fitting term exposes the risk of parameter overfitting, particularly when $D$ 7 is large. Use of protected operators, interval arithmetic, and careful management of parameter scaling act to minimize $D$ 8, thereby promoting robustness.
Local Search of Constants: Aggressive fitting of numerical constants may yield low empirical error but leads to increased model sensitivity—again, visible as growth in $D$ 9.

These connections provide a formal explanation for many empirical regularities in GP-based SR observed over decades of practice, such as the well-documented efficacy of parsimony pressure and the dangers of excessive structure or parameter complexity.

Theoretical Implications and Future Directions

The analysis is uniform and agnostic to the data, so while it provides qualitative guidance and worst-case guarantees, it may be loose in practice. The explicit decomposition nonetheless forms a basis for data-dependent refinements—such as empirical complexity measures or PAC-Bayesian approaches—that could yield tighter, sample-aware bounds.

There are concrete algorithmic implications:

Generalization-Aware Objective Functions: The derived bound may be used directly as part of model selection or as a regularizer, paralleling approaches using VC dimension or norm-based regularization in other machine learning subfields.
Guided Search and Complexity Control: Structural and sensitivity terms can be quantitatively evaluated during the evolutionary process, facilitating complexity-aware model selection and automatic adjustment of budgets for size, depth, and parameter range.

The extension to stochastic or data-dependent complexity estimates opens the prospect for refined analysis and new SR methodologies that do not rely on heuristics, but on explicit control of both combinatorial and parametric generalization contributions.

Conclusion

This paper provides a rigorous decomposition of the generalization capacity of GP-based symbolic regression into structure and parameter complexity terms. The framework both explains and gives theoretical support to widely-adopted SR design choices such as parsimony pressure and the use of numerically stable operators. The introduction of explicit, interpretable bounds opens avenues for principled model selection and further theoretical advances, including data-dependent generalization guarantees and their algorithmic exploitation. These results serve as a foundation for both improved empirical practice and future theoretical developments in interpretable regression modeling.

Reference:

"On the Generalization Bounds of Symbolic Regression with Genetic Programming" (2604.17402)

Markdown Report Issue