Vertical Symbolic Regression (2312.11955v1)
Abstract: Automating scientific discovery has been a grand goal of AI and will bring tremendous societal impact. Learning symbolic expressions from experimental data is a vital step in AI-driven scientific discovery. Despite exciting progress, most endeavors have focused on the horizontal discovery paths, i.e., they directly search for the best expression in the full hypothesis space involving all the independent variables. Horizontal paths are challenging due to the exponentially large hypothesis space involving all the independent variables. We propose Vertical Symbolic Regression (VSR) to expedite symbolic regression. The VSR starts by fitting simple expressions involving a few independent variables under controlled experiments where the remaining variables are held constant. It then extends the expressions learned in previous rounds by adding new independent variables and using new control variable experiments allowing these variables to vary. The first few steps in vertical discovery are significantly cheaper than the horizontal path, as their search is in reduced hypothesis spaces involving a small set of variables. As a consequence, vertical discovery has the potential to supercharge state-of-the-art symbolic regression approaches in handling complex equations with many contributing factors. Theoretically, we show that the search space of VSR can be exponentially smaller than that of horizontal approaches when learning a class of expressions. Experimentally, VSR outperforms several baselines in learning symbolic expressions involving many independent variables.
- Scientific Discovery: Computational Explorations of the Creative Process. The MIT Press, 02 1987. ISBN 9780262316002.
- The processes of scientific discovery: The strategy of experimentation. Cognitive science, 12(2):139–175, 1988.
- Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023.
- Pat Langley. Data-driven discovery of physical laws. Cognitive Science, 5(1):31–54, 1981.
- Douglas B. Lenat. The ubiquity of discovery. Artificial Intelligence, 9(3):257–285, 1977. ISSN 0004-3702.
- Distilling free-form natural laws from experimental data. Science, 324(5923):81–85, 2009.
- Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In GECCO, pages 1084–1092. ACM, 2019.
- Taylor genetic programming for symbolic regression. In GECCO, pages 946–954. ACM, 2022.
- Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. In ICLR. OpenReview.net, 2021.
- Symbolic regression via deep reinforcement learning enhanced genetic programming seeding. In NeurIPS, pages 24912–24923, 2021.
- Trent McConaghy. Ffx: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX, pages 235–260. Springer, 2011.
- Elite bases regression: A real-time algorithm for symbolic regression. In ICNC-FSKD, pages 529–535. IEEE, 2017.
- Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science, 367(6481):1026–1030, 2020.
- Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, 2019.
- Machine learning conservation laws from trajectories. Phys. Rev. Lett., 126:180604, May 2021.
- Physics knowledge discovery via neural differential equation embedding. In ECML/PKDD (5), volume 12979 of Lecture Notes in Computer Science, pages 118–134. Springer, 2021.
- Neural ordinary differential equations. In NeurIPS, pages 6572–6583, 2018.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 113(15):3932–3937, 2016.
- R.E. Valdés-Pérez. Human/computer interactive elucidation of reaction mechanisms: application to catalyzed hydrogenolysis of ethane. Catalysis Letters, 28:79–87, 1994.
- Functional genomic hypothesis generation and experimentation by a robot scientist. Nature, 427(6971):247–252, 2004.
- The automation of science. Science, 324(5923):85–89, 2009.
- Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society, 389:1179–1189, 2008.
- A Defence of the Doctrine Touching the Spring and Weight of the Air. F.G. for Thomas Robinson, 1662.
- Joseph-Louis Gay-Lussac. The expansion of gases by heat. In Annales de chimie, volume 43, page 137, 1802.
- Amedeo Avagadro. Essay on determining the relative masses of the elementary molecules of bodies and the proportions by which they enter these combinations. J. Physique, 73:58–76, 1811.
- Designing computer experiments to determine robust control variables. Statistica Sinica, 14(2):571–590, 2004. ISSN 10170405, 19968507.
- The Design and Analysis of Computer Experiments. Springer series in statistics. Springer, 2003.
- Pat Langley. BACON: A production system that discovers empirical laws. In IJCAI, page 344. William Kaufmann, 1977.
- Pat Langley. Rediscovering physics with BACON.3. In IJCAI, pages 505–507. William Kaufmann, 1979.
- BACON.5: the discovery of conservation laws. In IJCAI, pages 121–126. William Kaufmann, 1981.
- Reinforcement learning for automated scientific discovery. In AAAI Spring Symposium on Computational Approaches to Scientific Discovery, 2023.
- Declarative bias in equation discovery. In ICML, pages 376–384. Morgan Kaufmann, 1997.
- Symbolic regression is np-hard. Trans. Mach. Learn. Res., 2022:1–11, 2022. ISSN 2835-8856.
- Semantically-based crossover in genetic programming: application to real-valued symbolic regression. Genet. Program. Evolvable Mach., 12(2):91–119, 2011.
- Deep generative symbolic regression with monte-carlo-tree-search. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 15655–15668. PMLR, 2023.
- Symbolic physics learner: Discovering governing equations via monte carlo tree search. In ICLR. OpenReview.net, 2023.
- Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2000.
- Rethinking symbolic regression datasets and benchmarks for scientific discovery. In NeurIPS 2022 AI for Science: Progress and Promises, 2022. URL https://openreview.net/forum?id=oKwyEqClqkb.
- Modern experimental design. Journal of Statistical Theory and Practice, 1(3-4):501–506, 2007.
- Qi Chen and Bing Xue. Generalisation in genetic programming for symbolic regression: Challenges and future directions. In Women in Computational Intelligence: Key Advances and Perspectives on Emerging Topics, pages 281–302. Springer, 2022.
- A computational framework for physics-informed symbolic regression with straightforward integration of domain knowledge. Scientific Reports, 13(1):1249, 2023.
- Active learning improves performance on symbolic regression tasks in stackgp. In GECCO Companion, pages 550–553. ACM, 2022.
- Active learning informs symbolic regression model development in genetic programming. In GECCO Companion, pages 587–590. ACM, 2023.
- John R Koza. Genetic programming as a means for programming computers by natural selection. Statistics and computing, 4:87–112, 1994.
- Equation discovery for model identification in respiratory mechanics of the mechanically ventilated human lung. In Discovery Science, volume 6332 of Lecture Notes in Computer Science, pages 296–310. Springer, 2010.
- Probabilistic grammars for equation discovery. Knowl. Based Syst., 224:107077, 2021.
- Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
- Rademacher complexity for enhancing the generalization of genetic programming for symbolic regression. IEEE Trans. Cybern., 52(4):2382–2395, 2022.
- Pushing the frontiers of density functionals by solving the fractional electron problem. Science, 374(6573):1385–1389, 2021. doi: 10.1126/science.abj6511.
- Density functional theory: a practical introduction. John Wiley & Sons, 2022.
- Reasoning about nonlinear system identification. Artificial Intelligence, 133(1):139–188, 2001.
- Inductive process modeling. Machine Learning, 71:1–32, 2008.
- Discovering dynamics: From inductive logic programming to machine discovery. J. Intell. Inf. Syst., 4(1):89–108, 1995.
- Toward an artificial intelligence physicist for unsupervised learning. Phys. Rev. E, 100:033311, Sep 2019.
- Robust data-driven discovery of governing physical laws with error bars. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2217):20180305, 2018.
- Discovering physical concepts with neural networks. Physical review letters, 124(1):010508, 2020.
- Discovering symbolic models from deep learning with inductive biases. In NeurIPS, 2020.
- Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):1–16, 2020.
- Combining a gradient-based method and an evolution strategy for multi-objective reinforcement learning. Appl. Intell., 50(10):3301–3317, 2020.
- Learning to branch. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 353–362. PMLR, 2018.
- Neural symbolic regression that scales. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 936–945. PMLR, 2021.
- End-to-end symbolic regression with transformers. In NeurIPS, 2022.
- Steve Hanneke. Theory of disagreement-based active learning. Found. Trends Mach. Learn., 7(2-3):131–309, 2014.
- Near-optimal bayesian active learning with noisy observations. In NIPS, pages 766–774. Curran Associates, Inc., 2010.
- Daniel Kahneman. Thinking, fast and slow. Macmillan, 2011.
- Thinking fast and slow with deep learning and tree search. In NIPS, pages 5360–5370, 2017.
- Thinking fast and slow in AI. In AAAI, pages 15042–15046. AAAI Press, 2021.
- Herbert A Simon. Spurious correlation: A causal interpretation. Journal of the American statistical Association, 49(267):467–479, 1954.
- Pat Langley. Scientific discovery, causal explanation, and process model induction. Mind & Society, 18(1):43–56, 2019.
- Discovering causal structure: Artificial intelligence, philosophy of science, and statistical modeling. Academic Press, 2014.
- Causal identification under markov equivalence: Calculus, algorithm, and completeness. In NeurIPS, 2022.
- Judea Pearl. Causality. Cambridge university press, 2009.
- Symbolic regression via control variable genetic programming. In Machine Learning and Knowledge Discovery in Databases: Research Track, pages 178–195, Cham, 2023. Springer Nature Switzerland. ISBN 978-3-031-43421-1.
- AI feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity. In NeurIPS, 2020.
- Contemporary symbolic regression methods and their relative performance. In NeurIPS Datasets and Benchmarks, 2021.
- PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods. Bioinform., 38(3):878–880, 2022.
- Maarten Keijzer. Improving symbolic regression with interval arithmetic and linear scaling. In EuroGP, volume 2610 of Lecture Notes in Computer Science, pages 70–82. Springer, 2003.
- Michael F. Korns. Extremely accurate symbolic regression for large feature problems. In GPTP, pages 109–131. Springer, 2014.
- DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research, 13:2171–2175, jul 2012.
- Renáta Dubcáková. Eureqa: software review. Genet. Program. Evolvable Mach., 12(2):173–178, 2011.
- Neural program synthesis with priority queue training. CoRR, abs/1801.03526:1–16, 2018.
- Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8:229–256, 1992.
- Nico JD Nagelkerke et al. A note on a general definition of the coefficient of determination. Biometrika, 78(3):691–692, 1991.
- Function minimization by conjugate gradients. The computer journal, 7(2):149–154, 1964.
- Implementing the nelder-mead simplex algorithm with adaptive parameters. Comput. Optim. Appl., 51(1):259–277, 2012.
- Global optimization by basin-hopping and the lowest energy structures of lennard-jones clusters containing up to 110 atoms. The Journal of Physical Chemistry A, 101(28):5111–5116, 1997.
- A simplicial homology algorithm for lipschitz optimisation. J. Glob. Optim., 72(2):181–217, 2018.
- Generalized simulated annealing. Physica A: Statistical Mechanics and its Applications, 233(1-2):395–406, 1996.
- Paul Nicholas. A dividing rectangles algorithm for stochastic simulation optimization. In Proc. INFORMS Comput. Soc. Conf, volume 14, pages 47–61, 2014.
- Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput., 18(6):1245–1262, 1989.
- The Feynman lectures on physics. Addison-Wesley Boston, MA, USA, 1965.