Probabilistic Programs of Thought

Published 19 Apr 2026 in cs.CL, cs.AI, and cs.PL | (2604.17290v1)

Abstract: LLMs are widely used for code generation and mathematical reasoning tasks where they are required to generate structured output. They either need to reason about code, generate code for a given specification, or reason using programs of thought. The typical approach to code generation is to prompt the model and generate samples until an appropriate program is obtained. Within this process, sampling $n$ programs from the LLM requires $n$ GPU compute-intensive generations which becomes prohibitively expensive for larger values of $n$. In this work, we address this limitation by exposing the LLM's distribution within the generated programs themselves. We propose a novel test-time framework we dub probabilistic programs of thought to obtain more samples from the model with fewer LLM generations. Given a program generated by a model and the associated next-token probabilities, we build a probabilistic program that compactly represents exponentially many deterministic programs. Since performing probabilistic reasoning in this probabilistic program is much cheaper, our approach allows sampling new programs without any additional GPU compute and little CPU overhead. We instantiate our approach on benchmarks for code generation, code understanding and mathematical reasoning and report improvements in performance with fewer generations from the LLM.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a probabilistic framework that reifies token uncertainty into discrete random variables, enabling efficient sampling of candidate solutions.
It demonstrates significant improvements on tasks like GSM8K and code generation by producing multiple samples with no extra GPU cost.
The approach overcomes traditional LLM sampling bottlenecks, paving the way for scalable structured reasoning and advanced program synthesis.

Probabilistic Programs of Thought: A Technical Essay

Problem Setting and Motivation

The prevailing paradigm for leveraging LLMs in structured reasoning and code generation involves sequential autoregressive inference: generating whole programs token-by-token, then executing and checking them for correctness. This “sample-execute-verify” pipeline is widely used for code gen, mathematical reasoning, and related tasks, where multiple diverse samples are required to achieve strong pass@ $k$ rates. However, generation from LLMs is throughput-constrained: each sample is computed via an expensive GPU forward pass, so standard best-of- $n$ or beam approaches are fundamentally bottlenecked by LLM sampling cost. This bottleneck limits practical scalability, and the inability to efficiently access additional samples from the LLM’s program distribution restricts the quality and diversity of generated solutions.

This work introduces an alternative solution: treat LLM generations not as fully-determined code strings, but as stochastic "probabilistic programs," which expose structured latent distributions over many possible programs. By compactly encoding the local entropy of the LLM's next-token logits at key locations in a single program trace, the proposal enables efficient and cheap further sampling from the LLM’s implicit program distribution—without additional costly GPU forward passes.

Methodological Framework

The essential insight is to exploit the local uncertainty—captured in the model’s next-token distribution at each program token—during a single LLM invocation. Rather than discarding the predictive distributions after sampling, the framework synthesizes a "probabilistic program" by treating specified program tokens (e.g., constants, operators, subtrees) as random variables parameterized by the LLM's logits at those positions. In other words, each generated code solution is post-processed into a compact probabilistic program, where certain tokens are replaced by discrete random variables whose distributions are prescribed by the conditional LLM next-token probabilities.

Formally, given an LLM $\mathcal{M}$ and a prompt $\bm{t}$ , the canonical autoregressive generation process admits a probabilistic program interpretation:

$X_1 \sim \mathrm{Cat}(P_\mathcal{M}(\cdot\mid \bm{t})), \quad X_2 \sim \mathrm{Cat}(P_\mathcal{M}(\cdot\mid X_1,\bm{t})), \ldots$

with the generative output parsed and executed as a code artifact. The key contribution is to "reify" a subset $\mathbf{L}$ of these variables (e.g., all occurrences of digits, operator tokens, or specific positions) as random variables, leaving the remainder as fixed. The result is a hybrid deterministic/probabilistic program that encodes a local exponential family of programs around the seed solution.

Inference is then performed by efficient direct sampling from the resulting probabilistic program (which is tractable because the variables are typically independent and have small support), yielding new candidate samples without further LLM queries. Algorithmically, for each LLM-generated program, palette-purple (the name of the framework) identifies a suitable set of tokens to encode as random variables, replaces them with categorical distributions parameterized by the LLM next-token logits (restricted to the syntactically valid support for the program construct), and supports direct, conditional, or rejection-based sampling from the probabilistic program.

A crucial theoretical point concerns distributional faithfulness: since future LLM tokens may not be conditionally independent of local modifications, samples from a probabilistic program of thought are only distributed according to the true LLM program distribution under a specific independence assumption (namely, post-trace independence analogous to pseudo-likelihood). The authors formalize this, demonstrating that for suitably chosen token sets, empirical distributions from palette-purple converge to the original code distribution under this assumption.

Empirical Results: Computational and Statistical Efficiency

The empirical evaluation is extensive and benchmarks palette-purple against standard test-time sampling approaches on code and reasoning tasks: GSM8K (math), Plot2Code (code generation from visual input), and CRUXEval (program inversion). The core protocol is to fix a compute budget of $k$ LLM generations, and for each, generate $m$ samples from the resulting probabilistic program—effectively producing $k(1+m)$ candidate solutions at the cost of only $k$ LLM queries.

Strong quantitative improvements are observed. On GSM8K, for instance, obtaining $n$ 0 palette-purple samples per LLM generation yields a 2–7% increase in accuracy over LLM-only sampling, with zero additional LLM runtime cost. On code inversion and code generation tasks, palette-purple achieves up to 7% accuracy or match-score gains, outperforming beam search and best-of- $n$ 1 with the same or reduced effective sampling cost.

Runtime analyses confirm that palette-purple introduces negligible CPU overhead and no additional GPU utilization. Separate scaling law analysis shows that palette-purple not only shifts accuracy upward but increases the negative power-law exponent, meaning its marginal utility grows with $n$ 2. For large $n$ 3 and $n$ 4, palette-purple samples are empirically worth tens of additional LLM samples—delivering strong computational leverage.

Qualitative analysis shows that palette-purple is able to efficiently correct local errors, sample diverse functions within the feasible program family, and recover valid solutions that the LLM fails to output in its deterministic or beam search modes.

Theoretical and Practical Implications

From a theoretical standpoint, this work extends the interface between generative LLMs and PPLs. It operationalizes the “probabilistic program as distributional trace” perspective for LLMs, introducing a principled framework for local ambiguity exploitation. The approach is generic: it can be deployed for any LLM-based structured generation task and is agnostic to the backbone model and tokenizer, provided next-token probabilities are exposed.

The principal limitation—conditional independence of the reified tokens—can, in practice, be ameliorated by careful choice of token sets, restricting to low-memorization positions (e.g., digits, single tokens for operators/constants). However, modeling more complex factors (e.g., multi-token spans, identifiers, subtrees) involves exponential complexity in marginalization, and addressing this remains an open computational challenge.

On the applied side, this method fundamentally advances test-time sampling throughput for program synthesis, reasoning, code verification, and any setting requiring high diversity program output. By orthogonally addressing the sampling bottleneck, it enables broader and more extensive search within local program neighborhoods using probabilistic inference techniques. In addition, palette-purple is compatible with and complementary to other inference-time strategies such as reranking, MCMC, and SMC-based constrained decoding (Lew et al., 2023).

Future Research Directions

Future work may address several extension axes:

Rich Program Slicing: Moving beyond single-token random variables to more expressive schémata (identifiers, subtree rewrites, control flow).
Correlation Modeling: Incorporating sequence models over resampled tokens to relax the independence assumption.
Probabilistic Constraint Imposition: Enabling complex conditional sampling, rejection, or guided inference by composing program constraints.
Expectation Computation: Leveraging probabilistic program structure for marginal, evidence, or expected value queries w.r.t. LLM's local distribution.
Hybrid Techniques: Integrating palette-purple with MCMC or SMC sampling for global explorations, or coupling with non-parametric rerankers/scorers.

Given these directions, palette-purple opens a promising research path for tractably exposing the internal stochasticity of LLMs in the service of efficient structured generation and reasoning.

Conclusion

Probabilistic programs of thought constitute a compelling approach for structured output generation from LLMs. By directly leveraging next-token distributions to build rich probabilistic program abstractions around each generation, the method breaks the traditional LLM sampling bottleneck, significantly improving sample efficiency and coverage at test time. It adds a rigorous probabilistic programming layer atop LLMs, yielding practical performance gains and suggesting a broad research agenda at the interface of generative AI and program synthesis (2604.17290).