Feynman Symbolic Regression Database

Updated 29 November 2025

The Feynman Symbolic Regression Database is a curated collection of physics-inspired regression tasks with detailed metadata and synthetic datasets.
It integrates advanced techniques like neural-symbolic regression, genetic programming, and dimensional analysis to ensure physically plausible and interpretable models.
Benchmark results demonstrate high recovery rates and robustness in solving complex equations, setting new standards for scientific discovery.

The Feynman Symbolic Regression Database (FSReD) is a foundational collection of physics-inspired regression tasks designed to systematically assess symbolic regression methodologies. Originating with the AI Feynman project and now forming the core of multiple benchmark suites, FSReD presents hundreds of equations from the Feynman Lectures on Physics—each paired with large, synthetically generated datasets and complete physical dimension annotations. The database has catalyzed advances in both algorithmic frameworks (neural-symbolic, genetic programming, Kolmogorov–Arnold networks) and scientific benchmarking standards for interpretable discovery.

1. Database Composition and Data Structure

FSReD comprises curated regression problems, each defined by a set of variables $x_1, ..., x_n$ , an output $y$ generated by an explicit physical formula, and rich metadata. The canonical release consists of 100 “mystery” equations plus a 20-equation bonus set from advanced texts, and recent extensions (SRSD-Feynman and Feynman-AI) expand coverage to 117–120 formulas (Udrescu et al., 2019, Matsubara et al., 2022, Bruneton, 24 Mar 2025).

Each task includes:

A data table (CSV) with $10^3$ – $10^6$ samples, variables uniformly or log-uniformly sampled in physically meaningful ranges (e.g., velocities, distances, temperatures).
A units table (CSV) specifying SI exponents for each variable and output (e.g., [m, s, kg, K, V]).
A ground-truth expression in LaTeX.
(SRSD, SRSD-Feynman) Optional dummy variables to test feature selection.

Dataset examples and physical coverage span classical mechanics, electromagnetism, quantum mechanics, and thermodynamics. Equation complexity ranges from polynomials to deeply nested transcendental forms.

2. Generation Protocols and Physical Realism

Data sampling rigorously enforces physical plausibility:

Variables span [1, 5], orders-of-magnitude, or mapped to angular domains as appropriate per formula.
Constants are fixed at canonical SI values (e.g., $c = 2.998 \times 10^8 \text{ m/s}$ ).
Output values remain noise-free by default, with controlled noise added for robustness experiments (Udrescu et al., 2020, Bruneton, 24 Mar 2025, Matsubara et al., 2022).
Dimensional analysis is automated using null-space techniques to generate dimensionless combinations and reduce variable counts, facilitating symbolic discovery.
Preprocessing utilities (as in AI Feynman) include neural network interpolators (for modularity detection), polynomial fitting (degrees 0–4), and brute-force symbolic search.

3. Benchmarking Frameworks and Evaluation Metrics

FSReD underpins leading symbolic regression benchmarks:

AI Feynman (Udrescu et al., 2019), SRSD-Feynman (Matsubara et al., 2022), and Feynman-AI (Bruneton, 24 Mar 2025) offer tiered difficulty splits (easy/medium/hard), solution rates, and robust test/train/validation partitions.
Solution validation is performed via algebraic simplification and symbolic canonicalization (e.g., SymPy).
Recent metrics include the Normalized Edit Distance (NED) (Matsubara et al., 2022): For a symbolic expression tree, if $d(f_{\rm pred}, f_{\rm true})$ is the tree-edit distance and $|f_{\rm true}|$ the true node count, $\text{NED} = \min[1, d / |f_{\rm true}|]$ . NED quantifies exact recovery and penalizes spurious/dummy features.

Benchmark	# Equations	Data size/sample	Solution rate (best)	Feature/dummy tests	NED metric
FSReD (AIF)	100+20	$10^5$ rows	Up to 100% (AIF v1); 53% (AIF v2)	No	No
SRSD-Feynman	120	$10^4$ rows	93.3% (Easy tier, KAN-SR) (Bühler et al., 12 Sep 2025)	Yes	Yes
Feynman-AI	117	$10^6$ rows	91.6% (QDSR) (Bruneton, 24 Mar 2025)	No	No

4. Symbolic Regression Methods and Algorithmic Innovations

FSReD has driven methodological innovation:

Neural-symbolic regression (AI Feynman): Exploits feed-forward networks for modularity and compositionality detection, recursively decomposing problems and searching for Pareto-optimal formulas in complexity vs. inaccuracy space. Gradient-based tests identify symmetries, separabilities, and compositional structure using formulas such as:

$V(x') = 1 - \lambda_{\min} \left[ \frac{1}{m} \sum_{j=1}^{m} \widehat{\nabla_{x'} f}(x', x''_j) \widehat{\nabla_{x'} f}(x', x''_j)^T \right]$

and additive separability scores

$S[f] = \frac{ |f_{,12}|^2 }{ |f_{,11} f_{,22}| + |f_{,12}|^2 }$

(Udrescu et al., 2020).

Genetic programming and Quality-Diversity (QDSR): Tree-based GP, QD grid (MAP-Elites) to maximize solution diversity and coverage across complexity bins. Dimensional analysis (DA) constraints guarantee physical admissibility. Explicit vocabulary expansion (dimensionless ratios, physics-inspired primitives) substantially improves recovery rates by simplifying target expressions (Bruneton, 24 Mar 2025).
Kolmogorov–Arnold Networks and divide-and-conquer symbolic extraction (KAN-SR): KANs provide expressive, sparse functional bases; recursive symbolic simplification leverages AI Feynman's symmetry and separability tests; symbolic fits map learned radial basis expansions to closed-form operators (Bühler et al., 12 Sep 2025).

5. Performance, Robustness, and Comparative Results

Experiments on FSReD variants demonstrate substantial robustness and accuracy:

On the original Feynman-AI benchmark (117 equations, up to six variables), QDSR achieves 91.6% exact recovery under noiseless conditions and 79.4% under 10% Gaussian noise, surpassing all prior SR algorithms by over 20 percentage points (Bruneton, 24 Mar 2025).
KAN-SR attains 93.3% solution rate on Easy SRSD-Feynman problems, with significant resilience to dummy features and output noise (maintaining >50% success at $\gamma = 0.1$ noise) (Bühler et al., 12 Sep 2025).
AI Feynman 2.0 demonstrates noise robustness up to $r=-1$ (log-uniform Gaussian with $10^{-1}$ standard deviation), solving 73 of 100 equations at that level—orders of magnitude above previous state-of-the-art (Udrescu et al., 2020).
SRSD benchmarks (with dummy variables and realistic physical ranges) reveal degradation for all methods, confirming the need for robust feature-selection and interpretable metrics like NED (Matsubara et al., 2022).

6. Extensions: Analytic Integral Regression and Automation

FSReD concepts extend to analytic regression for Feynman integrals:

Symbolic regression from high-precision numerical data is possible by constructing linear ansätze over known function spaces (polylogarithms, elliptic generalizations) and fitting rational coefficients via lattice reduction (LLL, BKZ) (Barrera et al., 23 Jul 2025).
Automation pipelines are proposed: input topology → analytic function basis (alphabet, Landau singularities) construction → arbitrary-precision sampling → LLL-based coefficient fitting → formula storage and indexing.
Full database infrastructure includes schema for integral topology, kinematic ranges, function alphabets, expansion parameters, sampled data, recovered expressions, and cross-validation against known analytic results.

7. Research Impact, Limitations, and Future Directions

The Feynman Symbolic Regression Database defines the state-of-the-art for physics-driven interpretable regression:

It enforces rigorous physical realism in data generation, promoting recovery of meaningful and succinct laws.
The integration of modularity detection, Pareto-optimal search, and domain knowledge catalyzes rich algorithmic development.
Limitations remain in scalability to high-dimensionality, recursive depth, and the choice of basis functions—where certain primitives or algebraic forms may hinder convergence.
Open problems include automated module discovery (multi-output/vectors), expansion of primitive sets (e.g., $\tanh$ , special functions), latent-variable symbolic regression, and benchmarking on complex scientific datasets (CFD, molecular simulation).
Analytic regression and integrals can be robustly automated, enabling bottom-up database construction and complementing analytic structure-based approaches.

FSReD continues as an indispensable resource for symbolic machine learning, physics-inspired regression, and scientific discovery, with evolving extensions addressing density estimation, integral regression, and broader benchmarking standards.