LLM Programs: Executable Language Model Systems

Updated 4 July 2026

LLM Programs are a design space where LLMs are embedded as executable computational artifacts with explicit state, control flow, and validation procedures.
They incorporate methodologies such as natural language execution, typed declarative embedding, and structured intermediate representations to bridge prompt engineering and programmatic systems.
They enhance performance and reliability by integrating runtime orchestration, self-improvement loops, and robust verification mechanisms to ensure safe and efficient execution.

LLM Programs denotes a family of computational artifacts in which LLMs participate as executable components rather than merely as prompt-completion services. In current research usage, the term spans natural language code executed by an LLM, symbolic programs synthesized or repaired by an LLM, structured intermediate programs used for supervision or search, reasoning procedures executed as inference-time programs, and serving systems that expose inference itself as a programmable substrate (Cheng et al., 16 Dec 2025, Gim et al., 29 Oct 2025, Fu et al., 2024). A common thread is the elevation of prompts, routes, code, or reasoning traces into explicit program objects with state, control flow, interfaces, validation procedures, and runtime policies.

1. Definitions and conceptual scope

Research on LLM Programs uses several closely related definitions. “Natural language programming” treats prompts as executable code—“natural code”—for the LLM to execute, and contrasts this with the conventional view of prompts as isolated text inputs (Cheng et al., 16 Dec 2025). Another line distinguishes symbolic programs, whose deployed behavior is executed by a symbolic interpreter such as Python or regex, from prompt programs, whose deployed behavior is produced by an LLM at runtime; this distinction matters because their performance priors differ sharply (Zheng et al., 15 May 2026). A systems-oriented formulation goes further and argues that modern agentic and reasoning workloads should be treated as programs rather than independent requests, because they maintain state, issue multiple dependent or parallel LLM calls, and expose a tunable compute knob that affects both cost and accuracy (Fu et al., 2024, Luo et al., 19 Feb 2025).

Orientation	Program object	Representative systems
Natural-language execution	Embedded natural code in a host language	Nightjar (Cheng et al., 16 Dec 2025)
Typed declarative embedding	LLM functions inside SQL-like queries	BlendSQL (Glenn et al., 24 Sep 2025)
Reusable synthesized artifact	Deterministic executable program produced offline	TabClean, AlphaOPT (Wang et al., 24 Jun 2026, Kong et al., 21 Oct 2025)
Structured intermediate representation	Action programs or synthesis routes	LEAP, LLM-Syn-Planner (Dessalene et al., 2023, Wang et al., 11 May 2025)
Inference-time reasoning program	SC, Rebase, MCTS, ICoT executions	Dynasor/Certaindex (Fu et al., 2024)
Serving-time inference program	Server-side inference logic with stateful execution	Symphony, Autellix (Gim et al., 29 Oct 2025, Luo et al., 19 Feb 2025)

This variety suggests that “LLM Programs” is not a single formalism. A more precise characterization is that it names a design space in which LLM behavior is embedded into explicit computational structure: host-language interoperability, typed interfaces, executable search procedures, reusable code artifacts, or program-aware serving substrates. A plausible implication is that the term marks a shift away from isolated prompt engineering toward systems in which prompts, generated code, and runtime policies are all treated as programmatic entities.

2. Interfaces, state, and program representations

One influential direction formalizes the boundary between natural and formal code. Nightjar introduces the natural function interface (NFI) with values, effects, and handlers, and instantiates shared program state through effects such as Lookup, Assign, Deref, Ref, Set, and Goto. In Nightjar, natural code can read and write live variables in scope using syntax like <query> and <:response>, mutate existing host objects by reference, and participate in control flow through break, continue, return, and raise; the implementation specializes these abstractions for Python with Eval, Exec, and dedicated control effects (Cheng et al., 16 Dec 2025). This gives LLM-executed code mediated access to scopes, heap objects, and control state rather than forcing all interoperability through serialization and structured output.

A related but more declarative formulation appears in BlendSQL. Here, LLM operators such as llmqa and llmmap are embedded into a SQL-like host language, and the system infers output constraints from expression context before and during generation. The paper presents examples such as $f() > 40 \Rightarrow f() \rightarrow \text{int}$ , city = f() implying $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ , and team IN f() implying a list of literals derived from database contents. Constrained decoding then enforces well-typedness and database alignment instead of delegating normalization to multi-step post-processing (Glenn et al., 24 Sep 2025). This makes typing a first-class part of LLM program execution.

Many LLM Programs also rely on structured intermediate representations rather than raw prose. LEAP defines an action program $p_i$ as a structure of discrete tokens in which each token is either a conditional statement such as (while, <condition>) or (if, <condition>), or a sub-action statement (v, o) with $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ ; the representation encodes sub-actions, preconditions, postconditions, and control flow for egocentric actions (Dessalene et al., 2023). TabClean synthesizes an ordered sequence of guarded repair clauses,

$P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$

where the target column, repair skill, guard predicate, deterministic transformation, supporting evidence, and priority are all explicit (Wang et al., 24 Jun 2026). In route planning for chemistry, LLM-Syn-Planner linearizes an overview tree into a sequence of step dictionaries containing 'Molecule set', 'Rational', 'Product', 'Reaction', 'Reactants', and 'Updated molecule set', which makes the route easy to parse, verify, and mutate (Wang et al., 11 May 2025). These representations differ, but each turns LLM output into a machine-manipulable program object.

3. Synthesis, search, and self-improvement

A large subclass of LLM Programs uses the LLM as a proposer inside a larger search procedure. ALGO targets algorithmic programming problems by decomposing the judge as

$J(P)=J_T(P)\land J_S(P),$

then asking an LLM to generate a brute-force oracle that satisfies the semantic part $J_S$ on small inputs. That oracle becomes a verifier for search over efficient candidate programs, improving one-submission pass rate by $8\times$ over Codex and $2.6\times$ over CodeT on CodeContests, while oracle semantic correctness reached $88.5\%$ on LeetCode and $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 0 on CodeContests (Zhang et al., 2023). The key move is to externalize a reference semantics into an executable artifact rather than relying only on direct code generation.

TabClean makes a different synthesis move. Instead of invoking an LLM on each row or cell of a dirty table, it performs a finite-state workflow—Diagnose, Strategize, Code, Decide, Review, Commit—to compile LLM reasoning into a reusable guarded Python cleaning program. Once the best validated program $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 1 is committed, later schema-compatible batches require no LLM inference. Across six benchmarks, TabClean achieves the best F1 on five of six datasets; on the 200,000-row tax table it reaches precision $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 2, recall $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 3, and F1 $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 4, reduces runtime from $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 5 hours for Baran to $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 6 minutes, and uses $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 7 of IterClean’s API cost on that dataset (Wang et al., 24 Jun 2026). This is a particularly clear instance of an LLM Program as a compiled reusable artifact.

AlphaOPT applies the same principle to optimization modeling. It maintains a self-improving experience library whose entries are structured as $f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 8, and alternates between Library Learning and Library Evolution. Its formal objective is written as

$f() \rightarrow \text{Literal}[\{\text{'Washington DC'}, \text{'San Jose'}\}]$ 9

while refinement quality for a given insight is scored by

$p_i$ 0

Without parameter updates or annotated reasoning traces, AlphaOPT improves from $p_i$ 1 to $p_i$ 2 as training items increase from 100 to 300, and surpasses the strongest baseline by $p_i$ 3 on out-of-distribution OptiBench when trained only on answers (Kong et al., 21 Oct 2025). Here the “program” is not only the generated solver code, but the continual synthesis system that stores and revises explicit modeling knowledge.

Test-time search can also be performed directly over generated programs. “Probabilistic Programs of Thought” compiles one generated program plus next-token probabilities into a probabilistic program that represents many nearby deterministic programs, focusing on single-token components such as digits and operators. Because sampling from the compiled probabilistic program requires no additional GPU generations, the method improves GSM8K, CRUXEval, and Plot2Code with little CPU overhead and almost no extra wall-clock time relative to ordinary LLM inference (Garg et al., 19 Apr 2026). In chemistry, LLM-Syn-Planner uses the LLM to generate and mutate full retrosynthesis routes rather than isolated single-step reactions; on hard route-planning benchmarks, its solve rates are far higher than using the same LLM as a local single-step policy inside MCTS or Retro* (Wang et al., 11 May 2025). These results suggest that LLMs may be more effective as trajectory-level program proposers than as purely local action predictors.

4. Validation, verification, and secure execution

Because LLM Programs often lack a practical ground-truth oracle, validation is a central theme. “Metamorphic prompt testing” addresses this by lifting metamorphic testing from the program domain to the prompt domain: if semantically equivalent prompts generate semantically equivalent programs, disagreements across paraphrase-derived implementations on generated tests are evidence of error. On HumanEval with GPT-4, the method detects $p_i$ 4 of erroneous target programs with an $p_i$ 5 false positive rate under the best 5-prompt configuration (Wang et al., 2024). Its epistemic stance is asymmetric: disagreement is evidence of likely error, while agreement is not proof of correctness.

A stronger correctness target appears in SynVer, which synthesizes CompCert C with an LLM and checks it in a foundational Separation Logic setting using the Verified Software Toolchain. SynVer imposes verification-friendly biases—no novel helper functions and recursion instead of loops—then uses SepAuto to automate proof obligations generated from Hoare triples of the form $p_i$ 6. Across benchmarks with basic examples, Separation Logic assertions, and API specifications, most programs are synthesized correctly on the first attempt, and cross-verifier evaluation shows particularly strong results for SepAuto on the SL and API categories (Mukherjee et al., 2024). This suggests that LLM program synthesis becomes materially easier to certify when code generation is biased toward proof-friendly fragments.

Security-focused work treats execution itself as the problem. STELP inserts a trusted enforcement layer between LLM-generated code and the runtime: code is parsed into an AST, validated against a configurable safe grammar subset, transpiled into a secured execution path, and executed with runtime safety controls and tool-call mediation. On InjectedHumanEval, STELP achieves True Block Rate $p_i$ 7 and True Allow Rate $p_i$ 8; on a benign execution benchmark it achieves $p_i$ 9 correctness; and it reports that $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 0 of blocked samples were successfully repaired in under 2 retries with Llama 3.3-70B in the feedback loop (Shinde et al., 9 Jan 2026). ReFuzzer applies a related validator-centered pattern to compiler fuzzing: by feeding compiler and sanitizer diagnostics back to a local LLM, it raises validity of generated C test programs from $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 1 to $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 2 on GPU-based settings and increases coverage in optimization components such as vectorization, inlining, and dead code elimination (Shree et al., 5 Aug 2025).

Mutation-driven repair and semantic linting extend these validation ideas to quantum software. In automated repair for Qiskit programs, mutation analysis results—line number, mutation operator, traceback, and status—supply richer contextual evidence than basic runtime failures, and the best prompt configuration reaches a $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 3 repair success rate on the selected Bugs4Q subset (Yoshida et al., 18 Jan 2026). In linting, LintQ-LLM+CoT and LintQ-LLM+RAG outperform the rule-based LintQ on a corpus of 55 Qiskit programs, with F1-scores $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 4 and $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 5 versus $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 6, and the RAG variant slightly improves precision by reducing false positives (Cassieri et al., 5 May 2026). Together these systems make clear that LLM Programs are rarely deployed as trustless one-shot generators; they are embedded in evaluation loops, verifiers, sanitizers, and repair pipelines.

5. Runtime systems and inference orchestration

Another line of work treats the inference process itself as a programmable runtime. Symphony argues that current serving systems should serve programs, not prompts, and introduces LLM Inference Programs (LIPs) as server-side logic that can customize token prediction, manage KV cache, and integrate tool execution. Its central system call is

$v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 7

while KV state is virtualized through KVFS, a file-system abstraction over paged key-value cache. With two-level scheduling and application-controlled KV-cache policies, the paper reports up to $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 8 greater throughput compared to existing systems such as vLLM in the highlighted simulation setting (Gim et al., 29 Oct 2025). The significance is architectural: generation loops, cache reuse, and tool interactions become programmable server-side behavior.

Autellix adopts a similar stance for agentic workloads. It models an agentic program as a dynamic DAG of LLM calls and external interrupts, then schedules at the program level rather than the individual-request level. For single-threaded programs it uses PLAS, where a call’s priority is the cumulative runtime of prior completed calls in the same program; for distributed DAGs it uses ATLAS, which approximates critical-path progress. Across diverse workloads and LLMs, Autellix improves program throughput by $v \in \{\emptyset, Grasp, Release, Move, Use, Position, Goto, Wait\}$ 9 at the same latency compared to vLLM (Luo et al., 19 Feb 2025). This result depends on the claim that agentic LLM workloads are better understood as general programs with dependency structure, not as independent prompts.

Dynasor pushes the same idea into test-time reasoning. It treats self-consistency, Rebase, MCTS, and internalized chain-of-thought as reasoning programs with internal state, multiple dependent or parallel calls, and a compute knob. Its progress signal, certaindex, is based on answer stabilization, using semantic entropy

$P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 0

or reward-model scores to decide whether further compute is likely to change the final answer. Integrated into serving, this yields up to $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 1 compute savings and $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 2 higher throughput with no accuracy drop in the reported workloads, and supports up to $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 3 tighter latency SLOs (Fu et al., 2024). The implication is that reasoning-time control policies belong inside the runtime of an LLM Program, not only in outer orchestration code.

Probabilistic Programs of Thought fits this runtime perspective at a smaller scale. By exposing the model’s uncertainty inside generated code and turning selected token positions into categorical random variables, it allows repeated CPU-side sampling of new candidate programs without new GPU generations (Garg et al., 19 Apr 2026). This is a different kind of runtime control, but it shares the same principle: inference is itself a manipulable program object.

6. Applications, calibration, and open issues

The application range of LLM Programs is broad. LEAP uses GPT-4 to generate video-grounded action programs for egocentric clips in EPIC-Kitchens, representing actions through sub-actions, conditions, and control flow; it generates programs for 58,000 of 67,217 training clips, roughly $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 4 of the training set, and using these programs as auxiliary supervision improves action recognition from $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 5 to $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 6 and action anticipation from $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 7 to $P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 8 in the reported setup (Dessalene et al., 2023). In chemistry, LLM-Syn-Planner treats retrosynthesis as route-level decision programming and extends naturally to synthesizable molecular design under the constrained objective

$P=\langle c_1,\ldots,c_k\rangle,\qquad c_\ell=(A_\ell,s_\ell,g_\ell,f_\ell,e_\ell,\pi_\ell),$ 9

thereby using LLMs to propose and edit synthesis programs rather than merely suggest local reactions (Wang et al., 11 May 2025). In declarative data systems, typed LLM operators inside BlendSQL improve denotation accuracy by about $J(P)=J_T(P)\land J_S(P),$ 0 points in the strongest reported HybridQA execution setting while reducing latency by $J(P)=J_T(P)\land J_S(P),$ 1 relative to a comparable declarative baseline on TAG-Bench (Glenn et al., 24 Sep 2025).

Performance calibration is itself becoming a programmatic concern. “Predicting Performance of Symbolic and Prompt Programs with Examples” models each pass/fail execution as a Bernoulli variable with latent success probability $J(P)=J_T(P)\land J_S(P),$ 2, giving posterior

$J(P)=J_T(P)\land J_S(P),$ 3

Its empirical finding is that symbolic programs have an “all-or-nothing” prior, while prompt programs have a diffuse prior with many nearly-correct artifacts, which explains why a few tests can certify symbolic code but not prompt programs. RAP then constructs a task-specific prior by retrieving similar tasks and prompt programs and fitting a mixture of Beta distributions (Zheng et al., 15 May 2026). This suggests that reliability estimation for LLM Programs is itself an inference problem over program classes, not merely a matter of counting passed examples.

The open issues are correspondingly diverse. Shared-state systems report improved expressiveness and reduced lines of code, but Nightjar also reports runtime overheads of $J(P)=J_T(P)\land J_S(P),$ 4 relative to manual implementations and explicitly notes that safety and security remain underdeveloped (Cheng et al., 16 Dec 2025). Metamorphic prompt testing depends on paraphrase quality and only detects inconsistencies exposed by generated tests (Wang et al., 2024). STELP still leaves open questions about host-level isolation and broader repository-scale behavior (Shinde et al., 9 Jan 2026). AlphaOPT identifies retrieval misalignment as a persistent problem for experience libraries, and its refinement machinery exists precisely because a correct modeling rule can be retrieved in the wrong context (Kong et al., 21 Oct 2025). A plausible synthesis is that the next phase of LLM Programs will depend less on larger standalone models than on better interfaces, richer validators, stronger runtime substrates, and more explicit representations of when an LLM-generated program should be trusted.