Interpretable Symbolic Regression for Data Science: Analysis of the 2022 Competition (2304.01117v3)
Abstract: Symbolic regression searches for analytic expressions that accurately describe studied phenomena. The main attraction of this approach is that it returns an interpretable model that can be insightful to users. Historically, the majority of algorithms for symbolic regression have been based on evolutionary algorithms. However, there has been a recent surge of new proposals that instead utilize approaches such as enumeration algorithms, mixed linear integer programming, neural networks, and Bayesian optimization. In order to assess how well these new approaches behave on a set of common challenges often faced in real-world data, we hosted a competition at the 2022 Genetic and Evolutionary Computation Conference consisting of different synthetic and real-world datasets which were blind to entrants. For the real-world track, we assessed interpretability in a realistic way by using a domain expert to judge the trustworthiness of candidate models.We present an in-depth analysis of the results obtained in this competition, discuss current challenges of symbolic regression algorithms and highlight possible improvements for future competitions.
- Anaconda, “Anaconda software distribution.” [Online]. Available: https://anaconda.com/
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://dl.acm.org/doi/10.5555/1953048.2078195
- A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh et al., “Sympy: symbolic computing in python,” PeerJ Computer Science, vol. 3, p. e103, 2017.
- J. D. Romano, T. T. Le, W. La Cava, J. T. Gregg, D. J. Goldberg, P. Chakraborty, N. L. Ray, D. Himmelstein, W. Fu, and J. H. Moore, “PMLB v1.0: An open-source dataset collection for benchmarking machine learning methods,” Bioinformatics, 2022.
- R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore, “PMLB: A Large Benchmark Suite for Machine Learning Evaluation and Comparison,” BioData Mining, 2017.
- D. L. Randall, T. S. Townsend, J. D. Hochhalter, and G. F. Bomarito, “Bingo: a customizable framework for symbolic regression with genetic programming,” in Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2022, pp. 2282–2288.
- M. D. Schmidt and H. Lipson, “Coevolution of fitness predictors,” IEEE Transactions on Evolutionary Computation, vol. 12, no. 6, pp. 736–749, 2008.
- P.-A. Kamienny, S. d’Ascoli, G. Lample, and F. Charton, “End-to-end symbolic regression with transformers,” arXiv preprint arXiv:2204.10532, 2022.
- H. Zhang, A. Zhou, H. Qian, and H. Zhang, “Ps-tree: A piecewise symbolic regression tree,” Swarm and Evolutionary Computation, vol. 71, p. 101061, 2022.
- K. R. Broløs, M. V. Machado, C. Cave, J. Kasak, V. Stentoft-Hansen, V. G. Batanero, T. Jelen, and C. Wilstrup, “An approach to symbolic regression using feyn,” arXiv preprint arXiv:2104.05417, 2021.
- B. He, Q. Lu, Q. Yang, J. Luo, and Z. Wang, “Taylor genetic programming for symbolic regression,” arXiv preprint arXiv:2205.09751, 2022.
- S. Sahoo, C. Lampert, and G. Martius, “Learning equations for extrapolation and control,” in International Conference on Machine Learning. PMLR, 2018, pp. 4442–4450.
- G. Espada, L. Ingelse, P. Canelas, P. Barbosa, and A. Fonseca, “Data types as a more ergonomic frontend for grammar-guided genetic programming,” in Proceedings of the 21st ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences, GPCE 2022, Auckland, New Zealand, December 6-7, 2022, B. Scholz and Y. Kameyama, Eds. ACM, 2022, pp. 86–94. [Online]. Available: https://doi.org/10.1145/3564719.3568697
- B. Burlacu, G. Kronberger, and M. Kommenda, “Operon C++ an efficient genetic programming framework for symbolic regression,” in Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, 2020, pp. 1562–1570.
- T.-W. Huang, D.-L. Lin, C.-X. Lin, and Y. Lin, “Taskflow: A lightweight parallel and heterogeneous task graph computing system,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 6, pp. 1303–1320, 2022.
- T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.
- M. Cranmer, “Pysr: Fast & parallelized symbolic regression in python/julia,” 2020.
- M. Landajuela, C. Lee, J. Yang, R. Glatt, C. P. Santiago, I. Aravena, T. N. Mundhenk, G. Mulcahy, and B. K. Petersen, “A unified framework for deep symbolic regression,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=2FNnBhwJsHK
- S.-M. Udrescu, A. Tan, J. Feng, O. Neto, T. Wu, and M. Tegmark, “Ai feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity,” Advances in Neural Information Processing Systems, vol. 33, pp. 4860–4871, 2020.
- B. K. Petersen, M. L. Larma, T. N. Mundhenk, C. P. Santiago, S. K. Kim, and J. T. Kim, “Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients,” arXiv preprint arXiv:1912.04871, 2019.
- L. Biggio, T. Bendinelli, A. Neitz, A. Lucchi, and G. Parascandolo, “Neural symbolic regression that scales,” in International Conference on Machine Learning. PMLR, 2021, pp. 936–945.
- S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governing equations from data by sparse identification of nonlinear dynamical systems,” Proceedings of the national academy of sciences, vol. 113, no. 15, pp. 3932–3937, 2016.
- T. Mundhenk, M. Landajuela, R. Glatt, C. P. Santiago, B. K. Petersen et al., “Symbolic regression via deep reinforcement learning enhanced genetic programming seeding,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 912–24 923, 2021.
- G. Dick, C. A. Owen, and P. A. Whigham, “Feature standardisation and coefficient optimisation for effective symbolic regression,” in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, ser. GECCO ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 306–314. [Online]. Available: https://doi.org/10.1145/3377930.3390237
- G. Dick, “Genetic programming, standardisation, and stochastic gradient descent revisited: Initial findings on srbench,” in Proceedings of the Genetic and Evolutionary Computation Conference Companion, ser. GECCO ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 2265–2273. [Online]. Available: https://doi.org/10.1145/3520304.3534040
- D. Izzo, F. Biscani, and A. Mereta, “Differentiable genetic programming,” in Genetic Programming: 20th European Conference, EuroGP 2017, Amsterdam, The Netherlands, April 19-21, 2017, Proceedings 20. Springer, 2017, pp. 35–51.