Program Synthesis with LLMs

Updated 8 July 2025

Program synthesis with large language models is the process of automatically generating functional code from natural language specifications using transformer-based architectures.
It integrates advanced machine learning methods with classic synthesis and verification techniques, employing benchmarks like MBPP and MathQA-Python to evaluate performance.
Interactive refinement via prompt engineering and human feedback further improves accuracy, addressing challenges in semantic understanding and overfitting.

Program synthesis with LLMs refers to the use of large-scale transformer-based neural networks to generate computer programs from high-level specifications, most often in the form of natural language descriptions or input–output examples. The advent of LLMs trained on mixed code and text corpora has resulted in substantial advances in the automatic synthesis of functional code. This area is characterized by the integration of powerful machine learning methods with classic program synthesis and verification techniques, the development of rigorous benchmarks, and the careful paper of performance, trustworthiness, and the semantic abilities of such models.

1. Benchmarks and Evaluation Methodologies

A foundational aspect of progress in program synthesis with LLMs is the creation of large, systematically designed benchmarks that allow quantitative evaluation across diverse tasks and models. Two central benchmarks in this domain are the Mostly Basic Programming Problems (MBPP) dataset, consisting of 974 short Python programming tasks aimed at entry-level programmers, and MathQA-Python, a Python variant of the MathQA benchmark containing 23,914 mathematically posed problems requiring code generation under stricter correctness conditions (2108.07732).

Performance on these benchmarks is assessed through two primary measures:

Fraction of problems solved by any sample: For each problem, multiple completions are generated and the metric records if at least one solution passes provided unit tests.
Fraction of samples solving the task: This aggregates the proportion of all generated samples that are correct, reflecting model reliability.

A key result is the log-linear scaling law: synthesis performance, measured as the fraction of problems solved, grows approximately as a function of $\log(N)$ , where $N$ is the number of model parameters.

2. Prompt Engineering, Few-Shot Learning, and Fine-tuning Regimes

LLMs for program synthesis exhibit remarkable capabilities in both few-shot and fine-tuning regimes. In the few-shot setting, prompts consist of a small number of example tasks, each presented with a function signature, a natural language task description, and representative test cases. Empirical studies have demonstrated that the largest models—without any specific fine-tuning—can solve approximately 59.6% of MBPP problems in this regime, provided multiple completions are sampled and carefully crafted prompts are utilized.

Prompt design is critical: the precise choice of examples within the few-shot prompt and ensemble methods across different prompts can further improve performance. Contrastingly, fine-tuning the model on a held-out subset of the benchmark results in a significant improvement, adding roughly 10 percentage points in accuracy across model sizes and benchmarks. For instance, the largest fine-tuned model on MathQA-Python achieved 83.8% accuracy.

Prompt templates formalize the approach, ensuring consistent structure for conditioning. A representative format is:

Define the function {function_name} that {describes task}.
For example:
assert {function_name}({input1}) == {output1}
assert {function_name}({input2}) == {output2}
Write your solution below:

In addition to autonomous code generation, LLMs have been shown to benefit significantly from interactive settings where humans provide natural language corrections or hints. Experiments demonstrate that a single turn of natural language feedback can halve the error rate compared to the model's initial prediction; with up to four conversational turns, solve rates exceeded 65% for interactive tasks.

A detailed error analysis reveals that the nature of typical errors shifts with increasing model size. Smaller models are more prone to syntactic and type errors, while larger models exhibit a reduction in such errors and a corresponding increase in semantically incorrect (but syntactically valid) code. A notable but rare phenomenon is overfitting to prompt assertions—where the model outputs code tailored to pass included tests without solving the underlying problem, highlighting subtle challenges in test-driven code synthesis.

4. Semantic Grounding and Execution Understanding

A critical avenue explored is whether LLMs possess semantic grounding—the ability to reason about and predict the behavior of code on given inputs. Direct experiments show that even top-performing models achieve limited success (e.g., around or below 29% accuracy) at predicting execution outputs, indicating that while these models are effective at generating code that matches patterns seen in training examples, they lack deep symbolic reasoning or an internal simulation of execution required for robust semantic understanding.

This gap highlights the difference between models excelling at outputting code that "looks correct" or passes simple tests and those capable of robust, semantically verified synthesis.

5. Technical Features and Scaling

LLMs used for program synthesis generally follow the left-to-right, autoregressive Transformer decoder architecture introduced by Vaswani et al. (2017). Model sizes in leading studies range from hundreds of millions to over a hundred billion parameters, with pre-training performed on massive code–text corpora using subword vocabularies (e.g., 32k-token SentencePiece). Empirical studies confirm the log-linear relation between model size and accuracy, expressed as $P(N) \approx a \log(N) + b$ , where $P(N)$ is the synthesis performance and $a, b$ are dataset-dependent constants.

Fine-tuning details indicate that even modest data (e.g., 100 steps on MBPP with a learning rate of $3 \times 10^{-5}$ ) can yield significant improvements, underscoring the value of domain-specific adaptation.

6. Implications, Limitations, and Future Directions

Current findings establish that LLMs can synthesize correct and readable Python programs for a wide range of entry-level problems and that performance reliably improves with model size, prompt sophistication, and data quantity. Yet, limitations remain—chiefly in semantic understanding, robustness across problem types, and subtle forms of overfitting to test assertions. The inability to simulate execution and reason about dynamic behavior points to a need for either architectural or algorithmic advances.

Interactive synthesis using conversational feedback, improved prompt ensembling strategies, and hybrid neurosymbolic methods that combine statistical code generation with classic program analysis or execution-based evaluation represent promising directions. Efforts to benchmark and probe for semantic reasoning, as well as to bridge the gap between pattern-matching and true program understanding, are ongoing priorities for future research in this field.

This synthesis of program synthesis with LLMs is derived from key findings and experimental results in (2108.07732), integrating their methodological, quantitative, and technical contributions as well as the factors currently limiting deployment and semantic robustness of LLM-driven code generation systems.

PDF Markdown Chat (Upgrade)

References (1)

Program Synthesis with Large Language Models (2021)