LLM-Guided Enumerative and Probabilistic Synthesis

Updated 14 February 2026

Enumerative and Probabilistic Program Synthesis with LLM Guidance is a paradigm that fuses systematic program enumeration with probabilistic heuristics and pretrained model priors for efficient search.
The approach leverages propose-and-verify loops where LLMs generate candidate programs that are empirically validated to balance completeness with computational efficiency.
Statistical methods and weighted enumeration via probabilistic grammars underpin the synthesis, ensuring robust performance across formal verification and probabilistic programming tasks.

Enumerative and probabilistic program synthesis with LLM guidance encompasses a range of algorithms that systematically explore program spaces by combining explicit enumeration, probabilistic heuristics, and statistical or syntactic priors provided by pretrained LLMs. This synthesis paradigm addresses the intractability of exhaustive search and the statistical inefficiency of purely neural approaches by leveraging the structural and semantic information encoded in LLMs to bias, prune, and guide search over discrete program representations, while often retaining the formal sample-efficiency and completeness guarantees of classical enumerative schemes.

1. Foundations: Enumerative and Probabilistic Program Synthesis

Enumerative program synthesis is the process of explicitly searching the space of candidate programs—usually defined as strings or trees over a context-free grammar—by systematic enumeration, often prioritized by program length or complexity. Classical enumerative methods guarantee completeness but incur exponential time in the maximal considered program length, rendering them infeasible for complex tasks or expressive languages. Probabilistic program synthesis generalizes this approach by integrating probabilistic models or cost functions over the space of programs, employing stochastic sampling, weighted enumeration, or beam search to more efficiently traverse high-probability or low-cost regions of the search space.

The core challenge arises from the exponential combinatorial explosion in the number of candidate programs as a function of length or grammar depth, and from the sparsity of correct or consistent solutions within this space. Direct application of enumerative methods is infeasible for large or open-ended grammars, especially in the presence of rich syntactic constraints and semantic generalization requirements.

2. LLM-Guided Search and Propose-and-Verify Schemes

LLM-guided program synthesis addresses exponential search complexity by exploiting pretrained LLMs as distributional priors or oracles over program space. In the LLM-ERM (Empirical Risk Minimization) framework (Singhal et al., 16 Oct 2025), program search is recast as a propose-and-verify loop: the LLM is used to sample $k$ candidate programs, each is empirically validated on training and held-out data, and the best-performing (e.g., lowest validation error) program is returned. This meta-algorithm is formalized as:

Input: Sets of training and validation examples, a prompt template, temperature, candidate budget ( $k$ ), and batch-size ( $b$ ).
Procedure: For $t=1,\ldots,k$ , the LLM is prompted (with in-context examples) to output up to $b$ syntactically valid candidate programs; each candidate is compiled and scored on the provided datasets; the best candidate with validation error below a threshold is selected; otherwise, the process continues until the budget is exhausted.

This approach replaces uniform enumeration over all strings of length up to $L$ with sampling from the LLM’s prior conditioned on the observed data (i.e., $k$ draws from $LLM(\cdot\,|\,\text{train data})$ ), followed by empirical risk minimization over the candidate set. There is no adaptivity or gradient feedback; search is entirely guided by the static, pretrained LLM prior and ERM selection (Singhal et al., 16 Oct 2025).

3. Statistical Theory and Sample Complexity

LLM-guided enumerative synthesis frameworks inherit favorable sample-complexity bounds from classical finite-class ERM analysis. For any finite candidate set $U$ of size $k\cdot b$ , selecting the best candidate via validation yields, with probability at least $k$ 0, a generalization error

$k$ 1

Given sufficient diversity in LLM proposals (i.e., non-trivial probability mass on succinct correct solutions), the total number of required candidate samples and validation points matches, up to logarithmic factors, the traditional Occam-type bound for minimal-length consistent programs:

$k$ 2

where $k$ 3 is the dataset size, $k$ 4 is the target error, and $k$ 5 is the size of the candidate pool (Singhal et al., 16 Oct 2025). Empirically, LLM concentration on short, semantically plausible programs leads to candidate pools that practical search can traverse orders of magnitude faster than classical exhaustive search, while still recovering statistically optimal scaling.

4. Probabilistic Grammars and Weighted Enumeration

A generalization is to parameterize search space exploration with probabilistic context-free grammars (PCFGs) distilled from LLM output frequencies. Methods such as HySynth (Barke et al., 2024) and pCFG-synth (Li et al., 2024) learn production rule probabilities $k$ 6 from raw LLM samples ( $k$ 7 per task is typical), smoothing counts to avoid zero probabilities. The induced program-level probability is $k$ 8, enabling bottom-up or A*-style enumeration over cost-weighted grammars, prioritizing high-probability programs. This hybridization achieves:

Bias toward LLM-plausible programs: Programs with frequent production traces in LLM samples are assigned lower enumeration cost.
Retention of completeness: Probabilistic search remains exhaustive up to a cost cap, preserving symbolic search guarantees if smoothing ensures $k$ 9 for all rules.

The iterative feedback loop for search can be formalized as:

LLM sampling/population of program fragments or full programs.
Extraction of rule frequencies and construction of PCFG.
Weighted enumeration or beam search over the PCFG.
Verification or counterexample-guided refinement.
Optionally, additional LLM queries for corrections or helper functions.

This process is compatible with Counterexample-Guided Inductive Synthesis (CEGIS) (Li et al., 2024).

5. Active Version Space and Transductive Pruning

Transductive program synthesis extends LLM-guided enumeration by using test inputs available at synthesis time for active, query-efficient pruning of the version space (Lee et al., 22 Sep 2025). The algorithm enumerates a finite set $b$ 0 of candidate programs (via LLM-guided sampling and filtering by training data consistency), and defines a finite version space $b$ 1 of test-output hypotheses. A greedy maximin selection strategy iteratively queries the LLM to predict the output for test input $b$ 2 that is expected to maximally shrink $b$ 3 in the worst case. Hypotheses inconsistent with observed outputs are pruned until only the correct hypothesis remains. This framework enables significant reductions in LLM query cost while improving robustness to limited training data (Lee et al., 22 Sep 2025).

6. LLM Guidance in Probabilistic and Bayesian Model Discovery

Probabilistic program synthesis settings, such as probabilistic graphical model induction and PPL code synthesis, further constrain program generation with semantic and statistical validity requirements (Kanda et al., 1 Sep 2025, Curtis et al., 4 May 2025). LLM proposals are filtered by (1) syntactic well-formedness (enforced by grammars and type signatures), (2) semantic constraints (e.g., valid distributional support, parameter signatures), and (3) statistical workflow diagnostics (e.g., split- $b$ 4, ESS, PSIS-LOO). Diagnostic failure triggers targeted resampling or refinement—typically LLM-driven repair prompts directed at failed program components or counterexamples.

Search for correct probabilistic programs iterates between candidate LLM-generated proposals, verification against constraints, and diagnostic-driven repair. In POMDP model synthesis (Curtis et al., 4 May 2025), search is additionally guided by Bayesian posterior surrogates over coverage, with Thompson sampling determining candidate refinement priorities.

7. Empirical Performance and Comparative Analysis

LLM-guided enumerative and probabilistic synthesis algorithms have been extensively evaluated on string transformation (Playgol), Python synthesis (MBPP+), visual reasoning (1D-ARC, Arc), formal SyGuS benchmarks, automata-based tasks (parity, palindrome, Dyck languages), cryptographic and compositional tasks (SHA-256 parity, cellular automata), and probabilistic programming domains (PyMC, POMDPs).

Empirically, LLM-ERM achieves error-free synthesis in 200 examples on tasks where gradient-based transformers (SGD-trained) require 100,000 samples and still overfit (Singhal et al., 16 Oct 2025). PCFG-guided hybrid search (HySynth) outperforms both LLM-only and baseline unguided search, solving 58% of Arc tasks (vs. 2% for LLM, 40% for unguided, and 51% for Arga) (Barke et al., 2024). In formal synthesis, the union of LLM and pCFG-guided enumerators outperforms cvc5 (80.1% vs. 68.1%) on SyGuS-Comp tasks (Li et al., 2024). In probabilistic program induction, Thompson sampling-guided LLM repair achieves coverage and expected return on par or better than tabular and cloning-based methods in both simulated and real-world POMDPs (Curtis et al., 4 May 2025). Robustness and efficiency are further improved by active hypothesis pruning in transductive frameworks (Lee et al., 22 Sep 2025).

Failure modes remain—LLM outputs can exhibit semantic or syntactic errors, probability surrogates may be polluted by invalid completions, and complex global constraints can escape context-free approximations. Nonetheless, program synthesis guided by probabilistically informed LLM enumeration and selection constitutes a practical, scalable, and sample-efficient alternative to both exhaustive symbolic search and purely neural learning paradigms.