Annealed SMC Samplers for Global Optimization

Updated 9 November 2025

The paper demonstrates that annealed sequential Monte Carlo samplers systematically enumerate candidate models up to a fixed complexity, ensuring global optimality.
It integrates an information-theoretic MDL-based model selection with robust, parallelized parameter fitting to overcome limitations of traditional stochastic approaches.
Empirical results show that these samplers yield interpretable, parsimonious models that outperform conventional genetic programming in both accuracy and efficiency.

Exhaustive Symbolic Regression (ESR) is a deterministic, globally optimal framework for symbolic regression that systematically enumerates all possible analytic expressions up to a prescribed complexity threshold and ranks them using the minimum description length (MDL) principle. ESR addresses two central limitations of conventional stochastic approaches—namely, the risk of missing optimal solutions and the ambiguity in function-selection criteria—by integrating exhaustive combinatorial search and information-theoretic model selection. The methodology is extensible to diverse operator sets and scientific domains; empirical studies show superiority or complementarity to traditional genetic programming-based methods.

1. Problem Formalization and Exhaustive Search Structure

In symbolic regression, the objective is to recover an analytic function $f(x;\theta)$ from a finite dataset $D = \{(x_i, y_i, \sigma_i)\}_{i=1}^N$ , where $y_i$ are observed targets, $\sigma_i$ measurement uncertainties, and $f$ is drawn from expressions constructed with operators and variables from a predefined basis $\mathcal{O}$ . ESR imposes a hard complexity budget $K$ (max tree nodes), and systematically enumerates every possible expression tree up to size $K$ , considering all labelings of internal nodes with operators of appropriate arity.

Unique expressions are identified at two levels:

Structural deduplication: Only tree shapes that yield valid, acyclic computation graphs are retained.
Algebraic equivalence: For each candidate, aggressive symbolic simplification (commutativity, distributivity, parametric redefinition, and canonicalization) is employed to collapse all algebraically equivalent forms to a canonical representative; parameter permutations and integer constant simplification are incorporated.

For each unique expression, free parameters $\theta$ are optimized using numerical likelihood maximization, and the full list of fitted models is ranked by a composite description-length objective.

2. Minimum Description Length (MDL) Model Selection

The MDL score $L(D; f)$ for a candidate $f(x;\theta)$ is formulated in nats as: $L(D; f) = -\log \mathcal{L}(\hat\theta) + k \log n + \sum_j \log c_j - \frac{p}{2} \log 3 + \sum_{i=1}^p \left[ \frac{1}{2}\log I_{ii}(\hat\theta) + \log|\hat\theta_i| \right]$ where:

$-\log \mathcal{L}(\hat\theta)$ : minus log-likelihood for the data under $\hat\theta$ (maximum-likelihood parameter fit); under Gaussian noise,

$-\log \mathcal{L}(\hat\theta) = \frac{1}{2} \sum_i \left[\frac{(y_i - f(x_i;\hat\theta))^2}{\sigma_i^2} + \log(2\pi \sigma_i^2)\right]$

$k$ : number of tree nodes; $n$ : basis/operator set size ( $\log n$ per node)
$\sum_j \log c_j$ : cumulative penalty for integer constants in normalization/canonicalization
$p$ : number of free continuous parameters
$I_{ii}(\hat\theta)$ : observed Fisher information matrix diagonal at $\hat\theta$ (second derivatives of $-\log \mathcal{L}$ )
$-\frac{p}{2}\log3$ : absorbing constant from parameter precision coding

This information-theoretic trade-off encodes both model complexity (structural and parametric) and data-fit (likelihood), favoring parsimonious, high-accuracy models.

3. Search and Fitting Algorithmic Workflow

The ESR pipeline follows this deterministic procedure:

Tree template enumeration: Generate all operator-arity tree shapes up to complexity $K$ (structural enumeration).
Operator assignment: For each template, exhaustively assign basis operators to tree nodes; maintain only valid, executable expression candidates.
Canonicalization and deduplication: Simplify and hash expressions to collapse all algebraic duplicates; retain only the canonical form.
Parameter fitting: For each unique expression, perform nonlinear optimization (e.g., BFGS) of real parameters $\theta$ to maximize data likelihood; employ multi-starts for robustness against local optima.
Scoring: Compute MDL score for each fitted model using the formula above.
Ranking: Sort all models by increasing $L(D; f)$ ; present the top-ranked model(s) or those within an information threshold $\Delta L$ .

Strong pruning heuristics (early canonicalization, partial tree simplification, parallelized fitting) and pre-computation of tree templates are applied to make exhaustive search computationally feasible for $K \lesssim 10$ .

4. Complexity and Computational Feasibility

The number of possible expressions before deduplication grows as $O(C_K n^K)$ , where $C_K$ is the $K$ -th Catalan number (tree shapes), and $n$ the operator set size. For $n\sim8$ and $K=10$ , this is $>10^6$ expressions, yet hash-based deduplication reduces the required parameter fits by up to three orders of magnitude (e.g., $5.2$M enumerated $\to$ $\sim 120,000$ unique for $n=8, K=10$ ).

With aggressive parallelization, empirical timing data indicate that for $K=10$ , enumeration and deduplication require $\sim 46$ minutes and parameter fitting $\sim 150$ CPU-hours on $196$ cores. Wall time scales as $(1.5n)^K$ ; beyond $K=11$ feasibility degrades rapidly.

5. Comparison with Traditional Genetic Programming Symbolic Regression

Property	ESR	Stochastic GP
Search completeness	Deterministic, globally complete up to $K$	Probabilistic, may miss optimum
Parameter fitting	Full numerical optimization per candidate	Local or post-hoc often used
Model ranking	Single scalar MDL score	Pareto front, selection is ad hoc
Algebraic equivalence	Canonicalization eliminates duplicates	No guarantee
Computational expense	High but parallelizable; bounded by $K$	Scalable but non-deterministic

ESR guarantees discovery of all possible functions (within $K$ ) and a single, principled model-selection mechanism. Empirically, ESR finds the true optimal formula in toy and physical-law benchmarks where stochastic GP often returns approximate forms or misses sharp performance transitions at threshold complexity.

6. Notable Empirical Results and Applications

Astrophysical applications demonstrate ESR’s impact:

Cosmic expansion rate $H(z)$ : ESR+MDL selects simpler models ( $H^2(z) = \theta_0(1+z)^2$ ; $k=5$ ) with significantly better description lengths $\Delta L$ than the standard $\Lambda$ CDM formula ( $H^2(z) = \theta_0 + \theta_1(1+z)^3$ ; $k=7$ ), for both cosmic-chronometer (32 points) and Pantheon+ supernova (1590 points) data. The MDL-optimal forms may suggest overparameterization or non-uniqueness in standard cosmological models for current data quality.
MOND radial-acceleration relation: ESR recovers hundreds of functions outperforming classic $\nu$ -functions in MDL; most best-fit models do not reproduce the deep-MOND limit, indicating the data alone do not uniquely specify the functional form.
Inflationary potential reconstruction: ESR with a grammar prior (Katz $n$ -gram) recovers physically preferred functional forms (e.g., $V(\phi) = \theta_0 [\theta_1 + \log^2\phi]$ ), with literature standards (Starobinsky, quadratic, quartic) ranking far lower in MDL.

Benchmarking on toy problems further demonstrates that ESR rediscovers canonical formulas (e.g., normalized Gaussian) exactly, identifying the minimal complexity required for zero-error representation.

7. Limitations, Extensions, and Significance

ESR is limited by exponential scaling in complexity and operator set size. For $K>11$ or very large operator banks, run-time and memory costs become prohibitive. Additionally, ESR only explores the expressivity permitted by the fixed basis and grammar; unrepresented classes of expressions or highly nested forms beyond $K$ are inaccessible.

Potential extensions involve hierarchical or language-model-based priors for operator selection (as shown in inflationary reconstructions), hybrid ESR-GP approaches (using ESR-generated models as seeds for further stochastic search at higher complexity), and the application of ESR to alternative coding/decoding or noise models. The MDL framework could also be adapted to cross-validation or predictive criteria when measurement uncertainties are unknown.

In summary, ESR provides a definitive, reproducible, and interpretable pipeline for symbolic regression, with a mathematically grounded model-selection criterion and deterministic convergence properties. Studies demonstrate that ESR can outperform dominant heuristic algorithms in both classical and scientific symbolic discovery, offering a credible alternative for critical domains where model optimality and interpretability are paramount (Desmond, 17 Jul 2025, Bartlett et al., 2022).

PDF Markdown Chat (Pro)

References (2)

(Exhaustive) Symbolic Regression and model selection by minimum description length (2025)

Exhaustive Symbolic Regression (2022)

Follow Topic

Get notified by email when new papers are published related to Annealed Sequential Monte Carlo Samplers.