AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity (2006.10782v2)

Published 18 Jun 2020 in cs.LG, cs.AI, cs.IT, math.IT, physics.comp-ph, and stat.ML

Abstract: We present an improved method for symbolic regression that seeks to fit data to formulas that are Pareto-optimal, in the sense of having the best accuracy for a given complexity. It improves on the previous state-of-the-art by typically being orders of magnitude more robust toward noise and bad data, and also by discovering many formulas that stumped previous methods. We develop a method for discovering generalized symmetries (arbitrary modularity in the computational graph of a formula) from gradient properties of a neural network fit. We use normalizing flows to generalize our symbolic regression method to probability distributions from which we only have samples, and employ statistical hypothesis testing to accelerate robust brute-force search.

Citations (161)

View on Semantic Scholar

Summary

The paper introduces a novel Pareto-optimal method that balances formula complexity with accuracy for symbolic regression.
It leverages generalized symmetries and graph modularity to enhance formula discovery and robustness against noisy data.
The approach streamlines statistical hypothesis testing to reduce parameter bias and extend symbolic regression to probabilistic datasets.

AI Feynman 2.0: Advancements in Symbolic Regression

The paper "AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity" presents an enhanced methodology for symbolic regression, an important computational task in extracting symbolic expressions that provide accurate yet simplified representations of data. The proposed method improves upon existing models by significantly increasing robustness against noise and erroneous data, thereby extending the capability of symbolic regression to discover formulas previously elusive.

The central innovation of this research is the incorporation of generalized symmetries and graph modularity in the computational graph of the learned formulas. By examining the gradients derived from neural network fits, the method identifies and leverages graph modularity in a comprehensive manner, unlike prior work that only considered specific bivariate functions. This generalization allows the method to discover any compositionality and any functions involving multiple variables, expanding its capability to identify the underlying structure of data effectively.

A significant contribution of this paper is the introduction of Pareto-optimality in selecting formulas, balancing the complexity of the expression with its accuracy. This approach enhances the robustness of the fitting process, allowing the algorithm to maintain high performance even when dealing with noisy datasets. Moreover, the use of normalized flows facilitates the extension of symbolic regression to probability distributions when only sample data is available, broadening the applicability of symbolic regression in data science.

The application and improvement of statistical hypothesis testing in rejecting non-candidate formula hypotheses streamline the brute-force search, adding efficiency and robustness. This statistical approach reduces the reliance on arbitrary hyperparameters, mitigating potential biases and ensuring the fitting procedure remains principled.

The implications of these advancements are significant. Theoretically, this enhanced approach to symbolic regression provides insights into the structure of scientific formulas, potentially replacing some traditional models with simpler, interpretable approximations. Practically, the method's robustness to noise and its capability to handle complex distributional data open avenues for its application in real-world data analysis scenarios, where interpretability and transparency are crucial.

Looking forward, further exploration into the broader application of graph modularity could lead to important breakthroughs in pattern recognition and automated model discovery. Moreover, the integration of these techniques into larger AI systems might enhance capabilities in symbolic reasoning tasks, effectively bridging the gap between interpretable models and performance-driven AI technologies.

Overall, AI Feynman 2.0 represents a significant step forward in symbolic regression, providing a more potent tool for identifying and understanding the symbolic relationships inherent in complex datasets. As the field continues to evolve, it offers the promise of integrating more sophisticated AI into scientific discovery processes, potentially leading to new insights previously not possible with existing tools.

PDF Markdown

Related Papers

Tweets

https://twitter.com/shyamal_chandra/status/1928298750922936596

YouTube

Show All Videos