Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Symbolic Regression with a Learned Concept Library (2409.09359v3)

Published 14 Sep 2024 in cs.LG, cs.AI, cs.NE, and cs.SC

Abstract: We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a LLM to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LaSR can be used to discover a novel and powerful scaling law for LLMs.

Citations (2)

Summary

  • The paper introduces LaSR, which integrates LLM-guided operations with genetic algorithms to enhance symbolic regression.
  • It employs a two-stage strategy that alternates between evolving hypotheses and abstracting natural-language concepts.
  • Experimental results show LaSR achieves a 72/100 exact match rate on Feynman equations, outperforming other state-of-the-art methods.

Symbolic Regression with a Learned Concept Library

The paper "Symbolic Regression with a Learned Concept Library" by Arya Grayeli et al. presents a novel approach to symbolic regression (SR) that leverages LLMs to enhance traditional genetic algorithms through the use of natural-language concepts.

Overview

Symbolic Regression (SR) is a problem wherein the goal is to find a mathematical expression that best fits a given dataset. Traditional approaches to SR heavily rely on genetic algorithms, which explore the space of potential solutions through mutation and crossover operations. The core contribution of this paper is LaSR (Latent Symbolic Regression), which introduces an innovative methodology that combines evolutionary algorithms with LLMs to guide and evolve SR through natural-language concept libraries.

Methodology

LaSR enhances traditional SR by inducing a library of abstract textual concepts using zero-shot queries to LLMs. The process involves the following key steps:

  1. Hypothesis Evolution: The existing hypotheses are evolved using standard evolutionary steps enriched with LLM-guided operations. This process is controlled by a hyperparameter p that determines the proportion of LLM-based operations.
  2. Concept Abstraction and Evolution: The best-performing hypotheses are summarized into natural language concepts, which are then added to a concept library. This library is evolved iteratively, ensuring that the search process benefits from progressively refined concepts.

The algorithm employs two main stages following an alternating maximization strategy:

  • Stage 1: Fix the set of concepts and maximize the hypotheses' fitness to the dataset.
  • Stage 2: Use the best hypotheses to induce a new library of concepts.

LaSR builds on PySR, which is a scalable genetic search algorithm. The modifications introduce LLM-augmented initialization, mutation, and crossover operations, thus incorporating natural language priors into the search process.

Experiments and Results

The authors validate LaSR on the Feynman equations, a standard benchmark for SR, and a set of synthetic tasks. The empirical results demonstrate that LaSR substantially outperforms state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Key findings include:

  • Feynman Equations Dataset: LaSR outperformed other methods with an exact match solve rate of 72/100. This represents a substantial improvement over PySR and other baseline methods.
  • Synthetic Dataset: The approach also showed robustness against data leakage by demonstrating improved performance on synthetic equations designed to test generalization capabilities.

Moreover, LaSR was also used to explore novel empirical relationships, with an effort directed at discovering LLM scaling laws on the BigBench dataset. The results were promising, indicating the potential utility of LaSR in generating novel scientific insights.

Implications and Future Directions

LaSR's ability to generate and evolve natural-language concepts for guiding symbolic regression has significant implications:

  • Enhanced SR Capabilities: The integration of LLMs enables the discovery of more accurate and interpretable equations, enhancing the practical utility of SR in scientific discovery.
  • Scalability and Generalization: The approach demonstrates scalability, as it utilizes progressively improving LLMs which are becoming increasingly accessible and powerful.
  • Cross-domain Applications: The methodology can potentially be adapted to other domains beyond SR, including program synthesis and other areas where evolving patterns and relationships are critical.

Conclusion

The integration of natural-language-driven guidance via LLMs into symbolic regression represents a significant stride towards more effective and interpretable SR solutions. Although this paper focuses primarily on SR, the principles and techniques introduced have broader applicability and promise for various computational tasks where discovering and evolving patterns are essential. Future research may explore fine-tuning the LLMs, expanding the application scope, and optimizing compute efficiency further.

Overall, LaSR stands as a robust method that not only achieves superior performance on established benchmarks but also unlocks new potential for automated empirical discovery in various scientific and engineering domains.

Youtube Logo Streamline Icon: https://streamlinehq.com