Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 71 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 426 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LLM-ERM: Sample-Efficient Program Learning via LLM-Guided Search (2510.14331v1)

Published 16 Oct 2025 in cs.LG

Abstract: We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short python code'') can be learned with asmall'' number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families. To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw $k$ candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. {\em Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples}. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training.

Summary

  • The paper introduces LLM-ERM, which leverages LLM-guided search to generate candidate programs with improved sample efficiency.
  • The experimental results show that LLM-ERM recovers target functions from only 200 samples, outperforming traditional SGD methods.
  • The framework delivers interpretable, executable outputs with reasoning traces that enhance transparency and candidate verification.

This paper presents the LLM-ERM framework, which enhances program learning through a propose-and-verify approach using LLMs. The framework leverages LLMs to identify candidate programs via LLM-guided search, significantly improving sample efficiency and computational feasibility compared to traditional exhaustive methods.

Introduction

Traditional program learning methods, which enumerate candidate programs and perform empirical risk minimization (ERM), are limited by computational constraints. This necessitates a large number of samples, especially with gradient-based training approaches, such as stochastic gradient descent (SGD), that tend to overfit on small sample sizes.

The authors introduce LLM-ERM, a framework that employs a reasoning-augmented LLM to generate candidate programs. The framework applies ERM-style selection on held-out data, allowing it to generalize well from a modest number of samples. Figure 1

Figure 1

Figure 1

Figure 1: LLM-ERM generalizes from 200 samples, while an SGD-trained LLM overfits.

Theoretical Framework

Theoretical analyses demonstrate SGD's inherent inefficiencies when applied to learning certain short programs that admit compact representations. The authors prove that while mini-batch SGD often requires exponentially many samples, LLM-ERM retains the statistical efficiency of finite-class ERM. The framework's success is largely attributable to the intelligent proposal of candidates by the LLM, which directs the search toward promising hypotheses.

Methodology

LLM-ERM uses a prompt constructed from training data, which guides the LLM in generating a pool of candidate programs. Each candidate is evaluated, and the one with the best validation performance is selected. This is an adaptive search process that avoids the pitfalls of exhaustive enumeration. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: LLM-ERM generalizes from 200 samples, while SGD-trained LLM overfits. With only 200 training examples per task, LLM-ERM typically recovers the target function exactly, whereas SGD training fits the training data but fails to generalize on most tasks.

Empirical Evaluation

Experiments encompass synthetic algorithmic tasks like parity and pattern matching, among others, to evaluate the framework on its generalization capabilities. Results indicate that LLM-ERM significantly exceeds the performance of baselines, achieving near-perfect generalization across various tasks with a fraction of the data.

The paper also highlights the inability of both fine-tuned pre-trained models and in-context learning with existing LLMs to match LLM-ERM's efficiency and accuracy. Figure 3

Figure 3

Figure 3

Figure 3: Fine-tuning pre-trained LLMs fails to overcome overfitting on algorithmic tasks.

Interpretability

The outputs of LLM-ERM are interpretable by design, consisting of human-readable, executable code accompanied by a reasoning trace. This trace offers insights into why certain candidates were chosen, allowing for inspection and modification, thus ensuring transparency in the learning process.

Conclusion

LLM-ERM's propose-and-verify model exemplifies a quantum leap in program learning by harnessing the reasoning power of LLMs, achieving sample efficiency previously deemed impractical. This advancement represents a pivotal step toward more intelligent and resource-efficient AI systems capable of learning succinct, generalizable hypotheses from limited data, illustrating the potential to bridge the gap between statistical and computational efficiency in machine learning tasks.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 15 tweets and received 39 likes.

Upgrade to Pro to view all of the tweets about this paper: