Papers
Topics
Authors
Recent
Search
2000 character limit reached

SMC Transformer Steering

Updated 21 April 2026
  • SMC Transformer Steering is a method that formulates constrained text generation as posterior inference in discrete probabilistic sequence models.
  • The approach utilizes a population of weighted particles with resampling and importance weighting to ensure global constraint satisfaction while balancing sampling and optimization.
  • It integrates with the LLaMPPL library and realizes comparable computational cost to beam search, achieving high validity and output diversity in practical LLM applications.

Sequential Monte Carlo (SMC) Transformer Steering is an inference-time method for enforcing complex syntactic or semantic constraints during the generation phase of LLMs. It frames constrained text generation as posterior inference in discrete probabilistic sequence models—termed Feynman–Kac Transformer models—and applies SMC to approximate the required posterior with a weighted population of particles. The approach achieves global constraint satisfaction, supports both sampling and optimization under constraints, and operates at computational cost comparable to strong LLM decoders such as beam search. SMC Transformer Steering is implemented in the LLaMPPL probabilistic programming library, designed for flexible integration with LLaMA-family Transformers (Lew et al., 2023).

1. Motivation and Conceptual Framework

LLMs such as GPT and LLaMA are highly effective at predicting the next token in open-ended settings, but fine-grained control over output—such as guaranteeing hard structural constraints, solving complex infill tasks, or simultaneously matching multiple prompts—remains challenging. Prompt engineering and plug-ins for greedy token masking often become trapped in local optima, while beam search—an optimization-based method—tends to collapse output diversity and is ill-suited for sampling under global constraints. Markov Chain Monte Carlo (MCMC) methods, though general, frequently suffer from slow mixing, especially in high-dimensional discrete spaces.

SMC steering recasts constrained generation as exact posterior inference. Rather than conditionally restricting candidates at each decoding step, it maintains a population of weighted sequences ("particles") that evolve according to a Markov sequence model with potentials (likelihood functions) encoding problem-specific constraints. The posterior over complete sequences is thereby approximated with a controlled diversity of candidate solutions, allowing for both hard constraints and a balance between sampling and optimization (Lew et al., 2023).

2. Probabilistic Model and SMC Inference

A Feynman–Kac Transformer model is constructed as follows. Let fθf_\theta denote the frozen LLM, which emits logits for next-token predictions. For generation, given an initial state s0s_0 (such as a prompt xx), the model employs:

  • Markov kernels Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta) that propose next tokens,
  • A potential Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 0 that scores transitions according to whether constraints are satisfied.

The unnormalized joint over the sequence path s1:s_{1:\infty} is:

p(s1:)=M1(s1s0)G1(s0,s1)M2(s2s1)G2(s1,s2)p(s_{1:\infty}) = M_1(s_1 | s_0) G_1(s_0, s_1) \cdot M_2(s_2 | s_1) G_2(s_1, s_2) \cdots

The model's posterior over completions terminating with EOS is then

P(s)t=1TMt(stst1)Gt(st1,st)P(s) \propto \prod_{t=1}^T M_t(s_t | s_{t-1}) G_t(s_{t-1}, s_t)

When the proposal kernel MtM_t is distinct from the LLM's unconstrained distribution pθp_\theta, importance weights are constructed as:

s0s_00

The population is periodically resampled when the effective sample size s0s_01 falls below a threshold fraction (typically s0s_02) of s0s_03 (the number of particles).

3. SMC Steering Algorithm and Implementation

SMC steering in LLaMPPL advances as follows (specialized to the s0s_04 importance weight view):

  1. Initialization: Start s0s_05 particles at prompt s0s_06, each with weight 1.
  2. Particle Propagation: For each active (non-EOS) particle s0s_07, clone s0s_08 times. Each clone s0s_09 proposes xx0 and updates its weight by xx1.
  3. Weight Normalization and Resampling: Normalize all weights. If xx2, resample without replacement using stratified sampling to maintain unbiased estimates of the partition function.
  4. Advancement: The resampled set of xx3 forms the next population.
  5. Output: After all particles reach EOS, the collection xx4 approximates the target posterior; the estimated evidence is xx5.

Task specification in LLaMPPL is accomplished by defining a subclassed Model with a step() method incorporating calls to:

  • self.transformer(context),
  • self.sample(dist, proposal=...),
  • self.condition(flag),
  • self.observe(dist, val),
  • and self.finish().

This approach supports declarative specification of constraints and objectives, as xx6 and xx7 are constructed from the probabilistic program logic.

4. Integration with Transformer Decoders

SMC Transformer Steering interfaces efficiently with LLMs by sharing a CachedTransformer object, which stores a trie of all explored token prefixes. Each query for next-token logits on a prefix checks the cache: if previously computed, cached key/value activations and logits are returned in xx8 time; otherwise, only the novel suffix is processed. This cross-particle and temporal caching amortizes inference cost and reduces unique full-model calls from nominal xx9 to Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)0, closely matching beam search in computational cost.

Constraints are integrated via the Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)1 potentials. Hard constraints are imposed by assigning zero weight to violating transitions, guaranteeing that particle resampling prunes infeasible paths.

5. Computational Complexity and Empirical Performance

Let Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)2 be the average completion length, Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)3 the particle count, and Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)4 the expansion factor. The computational and memory cost of SMC steering is summarized as:

Method Model Calls Memory Footprint
Beam Search Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)5 Per-beam KV caches
SMC Steering Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)6 (unique calls Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)7) Shared trie + per-particle weights

With moderate settings (Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)8, Mt(stst1,fθ)M_t(s_t | s_{t-1}, f_\theta)9–Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 00), SMC steering achieves within approximately Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 01 the latency of beam search, but with substantially stronger satisfaction of global constraints.

Empirically, in tasks such as infilling, SMC steering outperforms both local masking and beam search in the diversity and coverage of valid completions. For syntactic constraints (e.g., valid Python code), it achieves Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 02 validity with diverse outputs. In prompt intersection tasks, SMC steering optimizes the expected log-partition function with increasing Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 03, mitigating the mode collapse observed in beam search and the dead ends of local masking (Lew et al., 2023).

6. Hyperparameterization and Task Design

SMC steering exposes several tunable hyperparameters:

  • Particle count (Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 04): Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 05–Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 06 balances quality and cost; lower values may miss feasible paths under strong constraints, while higher values linearly increase computational expense.
  • Expansion factor (Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 07): Typically Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 08; higher Gt(st1,st,fθ)0G_t(s_{t-1}, s_t, f_\theta) \geq 09 enhances proposal diversity, at the expense of computational overhead.
  • Resampling criterion: ESS threshold (with s1:s_{1:\infty}0) governs tradeoff between sample diversity and degenerate particles.
  • Proposal design: Pushing as much constraint information as possible into s1:s_{1:\infty}1 reduces the variance of importance weights, yielding higher-quality samples for fixed s1:s_{1:\infty}2.

LLaMPPL automatically tracks the proposal (s1:s_{1:\infty}3) and potential (s1:s_{1:\infty}4) via program semantics, allowing focus on declarative task logic.

7. LLaMPPL Library and Practical Usage

LLaMPPL is a Python-based probabilistic programming library designed for specifying Feynman–Kac Transformer models and steering LLaMA-family decoders. Installation is via

s1:s_{1:\infty}5

A minimal workflow involves subclassing Model, defining the step() logic for the intended constraint or objective, and using SMCSteerer to execute steering. Example code structure:

s1:s_{1:\infty}6

LLaMPPL includes example programs for canonical tasks (infilling, structural constraint satisfaction, prompt intersection) and integrates a shared, efficient SMC engine, decoupling user-facing probabilistic program logic from backend inference strategies. The architecture facilitates adaptation to new LLMs or hybrid inference methods, such as combining SMC and MCMC within each decoding step (Lew et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SMC Transformer Steering.