Probabilistic Programs of Thought

Updated 4 July 2026

Probabilistic programs of thought are executable probabilistic codes that represent beliefs, plans, and hypotheses as distributions, unifying computation with uncertainty.
They integrate language with structured generative models, converting natural language into world models for commonsense inference and diverse planning tasks.
Multiple approaches—including nested inference, exact automata methods, and program learning—demonstrate practical trade-offs in efficiency, precision, and scalability.

Probabilistic programs of thought are computational formulations in which beliefs, plans, hypotheses, or candidate solutions are represented as executable probabilistic code, and reasoning is performed by inference over the distributions those programs define. Across recent work, the phrase has been used in multiple closely related senses: as a model of nested theory-of-mind reasoning in autonomous agents (Seaman et al., 2018), as a language-mediated “probabilistic language of thought” for world modeling and commonsense inference (Wong et al., 2023), as inference over program text itself (Perov et al., 2014), as exact posterior computation in restricted probabilistic languages via weighted automata (Geißler et al., 15 Dec 2025), as a source of posterior supervision for uncertainty-aware LLM training (Zhang et al., 26 May 2026), and as a test-time decoding method that exposes an LLM’s token distribution inside generated programs (Garg et al., 19 Apr 2026). In all of these uses, the central move is to treat structured computation not merely as output format but as the substrate on which uncertainty is represented and manipulated.

1. Conceptual scope and defining commitments

A common commitment across this literature is that a probabilistic program is simultaneously a generative model and an executable procedure. In "Learning Probabilistic Programs" (Perov et al., 2014), models are “samplers represented as program text,” and program text itself is treated as a random variable subject to Bayesian inference. In "From Word Models to World Models" (Wong et al., 2023), linguistic meaning is formalized as a context-sensitive mapping from natural language to a distribution over expressions in a probabilistic language of thought, after which inference over the translated program supports reasoning. In "Probabilistic Programs of Thought" (Garg et al., 19 Apr 2026), code generation by an autoregressive LLM is itself formalized as a probabilistic program over token sequences, parsing, and execution.

This shared perspective departs from viewing reasoning as either pure symbolic deduction or direct pattern matching over final answers. Instead, the intermediate object is a stochastic procedure: a world model, a planner, a sampler, or a distribution over nearby code alternatives. The resulting inference target may be a posterior belief state, a plan conditioned on expected utility, a learned sampler that reproduces observed data, or an execution-verified program candidate.

A recurrent distinction is between verbalized reasoning and executable reasoning. Program-based Posterior Training explicitly contrasts its approach with ordinary chain-of-thought by making the latent-variable model and its posterior the supervision target rather than merely verbalizing intermediate steps (Zhang et al., 26 May 2026). A closely related distinction appears in rational meaning construction, where the LLM translates language into code, while the probabilistic program performs the actual inference (Wong et al., 2023). This suggests that, within this family of work, “thought” is not identified with text alone but with structured generative computation.

2. Representational substrates and semantic formalisms

The representational substrate varies across papers, but each substrate is designed to preserve both compositional structure and uncertainty.

Substrate	Function	Representative source
Church	World modeling with `condition`, `query`, and `define`	(Wong et al., 2023)
Anglican	Inference over program text and probabilistic program compilation	(Perov et al., 2014)
Probability generating automata	Exact inference for a restricted class of discrete imperative probabilistic programs	(Geißler et al., 15 Dec 2025)
Pyro	Translation target for scenario-based posterior supervision	(Zhang et al., 26 May 2026)
Token-level compiled probabilistic programs	Local resampling around an LLM-generated program	(Garg et al., 19 Apr 2026)

The most explicit semantic development is the weighted-automata account of exact inference. "Probabilistic Programming Meets Automata Theory: Exact Inference using Weighted Automata" defines a weighted automaton as

$A = (Q, M, I, F),$

with finite state set $Q$ , transition matrix $M$ , initial-weight vector $I$ , and final-weight vector $F$ , and gives its semantics as

$\llbracket A \rrbracket = I M^* F.$

The paper instantiates this in the semiring of formal power series with nonnegative real coefficients, using program variables as indeterminates, and restricts transition labels to either $r$ or $r \cdot X$ for nonnegative real $r$ and program variable $X$ (Geißler et al., 15 Dec 2025). A weighted automaton of this kind is a probability generating automaton when its semantics is a probability generating function whose coefficients sum to at most $Q$ 0.

This semantic choice matters because it supports infinite-support distributions with finite-state automata. A self-loop labeled $Q$ 1 yields

$Q$ 2

which is the PGF of a geometric distribution with parameter $Q$ 3 (Geißler et al., 15 Dec 2025). The automaton remains finite-state, while loops encode the infinite series.

The language-to-world-model line uses Church as the reasoning substrate. There, utterances are translated into Church expressions such as condition, query, and define; the program then defines a distribution over possible worlds, and inference samples from that distribution conditioned on observations (Wong et al., 2023). By contrast, the program-learning line relies on a Turing-complete, higher-order probabilistic programming language, specifically Anglican, where higher-order constructs, recursion, local bindings, and eval make inference over program text feasible in principle (Perov et al., 2014).

3. Inference over plans, posteriors, and nested beliefs

A central theme is that reasoning is recast as inference over latent computational objects. In nested-agent settings, those objects are plans and beliefs about other agents’ plans. In exact-inference settings, they are posterior distributions over program variables. In test-time LLM settings, they are distributions over program neighborhoods.

"Nested Reasoning About Autonomous Agents Using Probabilistic Programs" formulates planning as inference with

$Q$ 4

where $Q$ 5 is a prior over trajectories and $Q$ 6 is a reward (Seaman et al., 2018). The concrete application is a partially observable pursuit-evasion game in a polygonal map derived from Bremen, Germany, with obstacles, limited visibility, uncertain starts and goals, RRT-generated trajectories, trajectory optimization, and isovist-based visibility under a $Q$ 7 viewing cone. The model has four levels: an episode model, an outer chaser model, a middle runner model, and an inner naive-chaser model. At each time step, the episode model performs sequential Monte Carlo; the chaser model draws $Q$ 8 candidate chaser trajectories; the runner model draws $Q$ 9 candidate runner trajectories; and the runner imagines a naive chaser trajectory. The paper emphasizes nested importance sampling rather than nested Monte Carlo, and studies particle allocations under fixed $M$ 0, including $M$ 1, $M$ 2, $M$ 3, and $M$ 4 (Seaman et al., 2018).

The experimental results quantify the effect of belief nesting. Detection rates are reported for four scenarios: smart chaser vs naive runner, smart chaser vs smarter runner, smartest chaser vs naive runner, and smartest chaser vs smarter runner, with rates of $M$ 5, $M$ 6, $M$ 7, and $M$ 8, respectively (Seaman et al., 2018). The reported interpretation is that deeper reasoning helps both evasion and detection, depending on which side has the stronger model.

Exact posterior inference is treated differently in the automata-theoretic framework. There, a program denotes a function from distributions to distributions, a PGA denotes a distribution as a PGF, and a statement is compiled into an automata operation such as product, concatenation, disjoint union, transition substitution, or reweighting/normalization (Geißler et al., 15 Dec 2025). The semantics is therefore a distribution transformer implemented on automata. For assignment of a constant, the paper gives the explicit equation

$M$ 9

where $I$ 0 performs substitution on transition labels and $I$ 1 is the automaton for the Dirac distribution at $I$ 2 (Geißler et al., 15 Dec 2025). The goldfish-piranha example illustrates conditioning and normalization: the normalized posterior automaton yields probability $I$ 3 that the bowl originally contained a piranha.

These two lines—nested simulation and exact automata-based inference—address different regimes. One targets online reasoning in quasi-realistic continuous environments with Monte Carlo estimators; the other targets exact inference for a restricted discrete imperative fragment. A plausible implication is that probabilistic programs of thought are not tied to a single inference algorithm, but to the more general idea that reasoning proceeds by operating on a structured probabilistic representation.

4. Language-informed world modeling and meaning construction

The language-to-PLoT framework gives the clearest account of probabilistic programs as a medium of thought rather than merely a vehicle for statistical modeling. "From Word Models to World Models" proposes rational meaning construction: a context-sensitive translation from natural language into a probabilistic language of thought, followed by inference over that representation (Wong et al., 2023). The architecture separates meaning construction from reasoning. A code-trained LLM, specifically Codex, translates utterances into Church code, while the probabilistic program handles belief updating and query answering.

The operational motif is a generative world model plus evidence or conditioning plus a query. The LLM prompt contains the generative world model in Church, several language-code pairs, and a new utterance; the completion produces Church code using constructs such as condition, query, or define (Wong et al., 2023). The translation is context-sensitive both to discourse context and to the current program context, which allows domain-specific functions such as won-against, children-of, filter-color, or restaurant_utility to be selected appropriately.

The paper demonstrates this architecture in four domains. In probabilistic reasoning, a Bayesian tug-of-war model assigns latent strength and laziness to players and supports inferences such as whether a player is probably strong or whether a future match is likely to be won. In logical and relational reasoning, a stochastic family-tree generator supports inference over kinship under uncertainty, with predicates such as father-of?, mother-of?, and grandparent-of?. In visual and physical reasoning, a tabletop scene model supports object-counting and comparative quantification, while a physics extension introduces mass, force, friction, collisions, and event predicates such as is_moving, is_resting, and is_hitting. In social reasoning, a restaurant-navigation gridworld with utilities and motion costs supports inverse planning and goal inference (Wong et al., 2023).

A distinctive feature is symbolic module integration. The PLoT can call Blender for rendering, a physics simulator for Newtonian dynamics, and a model-based planner implemented with value iteration (Wong et al., 2023). These modules are treated as subroutines inside the probabilistic program, not as separate systems. The same framework also allows world-model growth through define, including new lexical concepts and even construction of an entire tug-of-war world model from a textual description.

This line explicitly positions probabilistic programs as a unifying substrate for uncertainty, compositional structure, and simulation-based commonsense reasoning. It also rejects a strict equivalence between language and reasoning: language supplies a translation into code, but the probabilistic program carries out the inference.

5. Learning program text and using posteriors as supervision

Another major strand treats probabilistic programs not only as inference substrates but as learnable objects and supervision generators.

"Learning Probabilistic Programs" asks whether one can infer probabilistic program text from data (Perov et al., 2014). The target object is a sampler program: executable code that, when run repeatedly, generates samples matching the observed distribution. The learning setup introduces observed data $I$ 4, generated data $I$ 5, latent program text $I$ 6, summary functions, and compatibility terms, and then uses MCMC-ABC-style inference over candidate programs. The paper works in Anglican and relies especially on Metropolis-Hastings and PMCMC. In the authors’ framing, MCMC is both approximate Bayesian inference and a form of stochastic search over program space. The hierarchical prior over program text uses typed production rules, including variable selection, constants, primitive and stochastic procedures, compound procedures from a CRP/Dirichlet-process prior, let, if, and recursion. Production-rule probabilities are learned from a corpus of human-written samplers such as Box-Muller and Knuth’s Poisson algorithm (Perov et al., 2014).

The empirical program-learning results include rediscovery experiments for Bernoulli, Poisson, Gamma, Beta, Normal $I$ 7, and Normal $I$ 8, as well as approximation of some credit-approval feature distributions. One reported inferred Bernoulli sampler is $r$ 9 The paper also introduces a broader notion of probabilistic program compilation: automatically generating program text that directly samples from a posterior distribution produced by an original program. The Beta-Binomial example uses posterior samples from a model with exact posterior $I$ 9 and learns sampler code whose output matches that posterior (Perov et al., 2014).

"Using Probabilistic Programs to Train Inductive Reasoning in LLMs" shifts the role of probabilistic programs from learnable object to supervision mechanism (Zhang et al., 26 May 2026). Program-based Posterior Training proceeds in three stages. First, an LLM synthesizes open-world scenarios in domains such as sports, healthcare, and general reasoning. Second, another LLM prompt translates each scenario into a probabilistic program in Pyro encoding latent causes, observed evidence, and query variables. Third, the program is executed with probabilistic inference, typically MCMC or rejection sampling, to compute the posterior distribution over the queried quantity given the observations. That posterior becomes a soft supervision target.

The modeling setup uses latent variables $F$ 0, observations $F$ 1, and query variable $F$ 2 with joint factorization

$F$ 3

and posterior

$F$ 4

For training, the posterior over $F$ 5 is binned into $F$ 6 intervals with probabilities $F$ 7, and the LLM is optimized with a soft-label cross-entropy loss over answer tokens on a $F$ 8– $F$ 9 scale (Zhang et al., 26 May 2026).

The dataset includes 14,912 scenarios and 59,633 queries overall, while the abstract emphasizes over 10,000 unique open-world scenarios and over 50,000 queries. Training uses LoRA fine-tuning on Llama-3-8B-Instruct and Qwen-2-7B-Instruct with a single A100 GPU, AdamW, batch size 2, LoRA rank 8, and tuned learning rates in $\llbracket A \rrbracket = I M^* F.$ 0 (Zhang et al., 26 May 2026). The reported results show improved mean absolute error, alignment with human judgments, and calibration. On human judgments in sports scenarios, the fine-tuned model reaches overall $\llbracket A \rrbracket = I M^* F.$ 1, compared with $\llbracket A \rrbracket = I M^* F.$ 2 for the base LLM and $\llbracket A \rrbracket = I M^* F.$ 3 for the original MSA probabilistic-program outputs. On OpenEstimate, Llama’s MAE improves from $\llbracket A \rrbracket = I M^* F.$ 4 to $\llbracket A \rrbracket = I M^* F.$ 5 with PPT distributional training. The paper further reports gains in NLL and ECE on OpenEstimate, Bayesian Teaching, MMLU, TruthfulQA, HellaSwag, ARC-Challenge, and Winogrande (Zhang et al., 26 May 2026).

Taken together, these papers show two complementary uses of program-based uncertainty. One learns stochastic procedures as hypotheses. The other uses formally inferred posteriors from probabilistic programs to train LLMs toward calibrated inductive inference.

6. Test-time probabilistic programs, efficiency claims, and open limitations

The 2026 paper titled "Probabilistic Programs of Thought" uses the phrase in a narrower test-time sense: one LLM-generated program is turned into a compact probabilistic program that represents many nearby deterministic programs (Garg et al., 19 Apr 2026). The setting is the standard sample-execute-verify workflow for code generation, mathematical reasoning, and best-of- $\llbracket A \rrbracket = I M^* F.$ 6 decoding. The paper argues that repeated GPU generations are wasteful because the next-token probabilities attached to a sampled program already encode useful uncertainty about local alternatives.

The formalization begins with an autoregressive token process, then parsing and execution. The key construction selects a set of token positions $\llbracket A \rrbracket = I M^* F.$ 7 and turns those tokens into random variables whose distributions are derived from the model’s next-token probabilities. The resulting compiled object factors into tokens kept from the original LLM sample, tokens turned into random variables, and deterministic parse and execute constraints (Garg et al., 19 Apr 2026). The practical implementation deliberately restricts random variables to single-token program components—digits, comparison operators, arithmetic operators, and assignment operators—because multi-token component marginals are stated to be $\llbracket A \rrbracket = I M^* F.$ 8-hard.

The runtime argument is straightforward: with $\llbracket A \rrbracket = I M^* F.$ 9 LLM calls, the method obtains $r$ 0 base programs and then draws $r$ 1 additional samples from each compiled probabilistic program, yielding $r$ 2 candidate programs while keeping the number of expensive GPU generations at $r$ 3 (Garg et al., 19 Apr 2026). Compilation and sampling are CPU-side, and the runtime plots are reported to show that wall-clock time is dominated by the original LLM inference. The paper also provides a soundness theorem under an independence assumption: each resampled token must be independent of succeeding tokens given the preceding tokens. Under that assumption, the empirical distribution of samples from the compiled procedure converges almost surely to the target code distribution.

The benchmarks are GSM8K, Plot2Code, and CRUXEval-like structured output tasks. Reported gains are roughly $r$ 4 on GSM8K depending on model size, about $r$ 5 on Plot2Code, and about $r$ 6 on CRUXEval or structured output (Garg et al., 19 Apr 2026). The paper also states that PPoT can achieve the performance of 20 LLM samples using only about 8 LLM calls, that even one PPoT sample per LLM sample can be worth around 8 additional LLM samples at $r$ 7, and that 20 PPoT samples can be worth around 39 additional LLM samples.

Across the broader literature, limitations are explicit and significant. The automata-theoretic exact-inference framework is restricted to discrete probabilistic programs with assignments, sampling from predefined discrete distributions, conditioning, conditional branching, and rectangular guards in if statements; Boolean tests compare variables only to constants, while extensions to non-rectangular guards or while loops are future work. The prototype automatically computes the unnormalized posterior, and normalization is an additional step; moreover, a complete characterization of the class of distributions representable by PGAs remains open (Geißler et al., 15 Dec 2025). Program-based Posterior Training depends on the quality of LLM-synthesized Pyro programs, and the method is evaluated on numeric estimation and multiple-choice queries rather than fully open-ended inductive reasoning (Zhang et al., 26 May 2026). Test-time PPoT only explores a local neighborhood of the original sample, assumes approximate token independence for its soundness claim, produces samples that are not fully independent, and relies on a verifier or scoring function for best-of- $r$ 8 selection (Garg et al., 19 Apr 2026).

A common misconception is that these approaches all target the same problem. They do not. Some address exact inference in restricted languages, some model theory of mind in online planning, some translate language into executable world models, some infer program text from data, some train LLMs from posterior labels, and some resample code neighborhoods at test time. What unifies them is narrower and more precise: probabilistic computation is treated as the internal representational medium in which structured uncertainty can be expressed, transformed, and queried.