Byte-level Sequential Monte Carlo
- Byte-level Sequential Monte Carlo is a decoding-time inference method that ensembles language models over complete strings rather than relying on local next-token decisions.
- It employs a unified f-ensemble framework utilizing generalized means to combine disparate model scores into a true global ensemble distribution.
- The approach overcomes tokenizer mismatches by mapping outputs to a shared byte alphabet, enabling consistent inference across different vocabulary systems.
Searching arXiv for the cited paper and closely related context. Byte-level Sequential Monte Carlo (SMC) is a decoding-time inference method for sampling from a language-model ensemble defined over complete strings rather than local next-token decisions. In "Ensembling LLMs with Sequential Monte Carlo" (Chan et al., 5 Mar 2026), the method is introduced as part of a unified framework for composing LLMs into -ensemble distributions for a wide range of functions . Its distinguishing feature is that it operates in a shared character space, specifically a shared byte alphabet, which enables ensembles of models with mismatching vocabularies and consistent sampling in the limit (Chan et al., 5 Mar 2026). The method addresses a central problem in language-model ensembling: naïvely averaging next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings.
1. Global string ensembles and the failure of local normalization
The motivating claim is that naïvely averaging next-token probabilities is not the same as sampling from the true ensemble distribution over complete strings (Chan et al., 5 Mar 2026). If each model defines a distribution over strings , then a common decoding-time heuristic is to combine next-token probabilities at each step, for example by averaging or multiplying them locally. But this yields a distribution over prefix decisions that is only a locally normalized approximation to the desired ensemble over full strings.
The key issue is that if the ensemble over complete strings is defined first and the corresponding next-token conditionals are derived afterward, the correct next-token probability depends on the future mass of all completions. A token that looks good locally may lead to poor completions globally, and vice versa. Thus, decoding by local token averaging is a biased approximation to the intended global distribution over strings (Chan et al., 5 Mar 2026).
The paper illustrates this distinction with a prompt-intersection example built from “My favorite physicist is” and “My favorite author is.” Locally normalized token-by-token product ensembling tends to favor generic early completions that are individually likely under both prompts, even when the resulting full string is not globally likely under the intended intersection target. In the paper’s formulation, local scoring can over-reward prefixes like “a …” that are easy early on, rather than strings that are jointly good end-to-end.
A common misconception is that local agreement at each step is equivalent to global agreement over complete outputs. The paper directly rejects this equivalence. Byte-level SMC is designed precisely for the setting in which the intended target is the ensemble distribution over full strings rather than a heuristic over successive local decisions.
2. The unified -ensemble framework
The method is defined within a general ensemble family parameterized by an aggregation function
Given language-model potentials over strings, the unnormalized ensemble score of a string is
0
The normalized 1-ensemble distribution is
2
assuming 3 (Chan et al., 5 Mar 2026).
The role of 4 is explicit. Consensus-seeking aggregators emphasize overlap among models, whereas coverage-seeking aggregators spread mass across regions supported by any model. The paper particularly studies the generalized mean family,
5
with the special or limit cases 6 for minimum, 7 for product of experts, 8 for mixture or sum, 9 for maximum, 0 for the harmonic mean, and 1 for the quadratic mean (Chan et al., 5 Mar 2026).
The paper also states a variational characterization: the generalized mean is the unique minimizer of a weighted sum of 2-divergences, with
3
and 4 (Chan et al., 5 Mar 2026).
This framework places byte-level SMC within a broader family of global ensemble objectives. A plausible implication is that the algorithm is not tied to a single notion of model combination; rather, it is an inference procedure for a class of string-level targets induced by 5.
3. Shared byte space and tokenizer mismatch
A major practical problem in language-model ensembling is tokenizer mismatch: different models may use incompatible vocabularies, so token-level probabilities cannot be directly aligned (Chan et al., 5 Mar 2026). The paper’s solution is to move to a shared character or byte space, where every model can be interpreted as a distribution over byte strings. This sidesteps vocabulary alignment entirely.
If a model is originally defined over token sequences, it is induced into a distribution over byte strings 6 by summing over all tokenizations that decode to 7: 8 This is crucial because many token sequences can map to the same byte string, different models can then be compared in the same output space, and the ensemble target is well-defined over strings rather than tokenizations (Chan et al., 5 Mar 2026).
In practice, the paper mentions a character-to-byte mapping 9 with 0, so generation happens in a shared byte alphabet. The significance of this representation is operational rather than merely notational: every model can be mapped to the same byte alphabet, ensemble scoring is done on shared strings, no union-vocabulary heuristic is needed, and exact string identity, not token identity, is the unit of comparison (Chan et al., 5 Mar 2026).
The same section of the paper also notes the cost of this choice. Byte-level generation typically requires more steps, and computing byte probabilities from tokenized models requires marginalizing over tokenizations, so it is more expensive than token-level SMC.
4. Sequential importance sampling and the byte-level SMC procedure
The algorithmic objective is to sample strings from
1
without needing to enumerate all strings (Chan et al., 5 Mar 2026). Because LLMs are autoregressive, any string probability can be factorized into prefix conditionals. The paper defines prefix probabilities and conditional prefix probabilities for strings 2 and prefixes 3, and SMC operates on partial strings while using a proposal to extend them one symbol at a time.
The importance-sampling view begins with complete strings. Proposal samples are drawn as 4 with importance weights
5
Then
6
is an unbiased estimator of 7, and the self-normalized estimate of the ensemble distribution is consistent as 8 (Chan et al., 5 Mar 2026).
Because strings are generated incrementally, the paper uses sequential importance sampling (SIS). To do this efficiently, it introduces a shaping function 9 that is tractable at prefixes and guides the proposal. A standard choice is
0
but this is not itself the target distribution; it is a surrogate used to define proposal dynamics. The prefix-shaped conditional is defined as
1
The locally optimal proposal minimizing per-step weight variance is proportional to the shaping conditional: 2 (Chan et al., 5 Mar 2026).
The concrete SMC procedure maintains 3 particles, each a partial byte string with a weight. Initialization is
4
where 5 is the empty string. While some particles are unfinished, the algorithm samples the next byte
6
marks the particle complete if 7, otherwise appends it,
8
and updates the weight by
9
After each step, the effective sample size is computed as
0
If 1 for threshold 2, particles are resampled multinomially: 3 Each particle is then replaced by its ancestor and its weight is reset to 4. At the end,
5
5. Correctness guarantees and the role of annihilative aggregation
Under standard importance-sampling assumptions, the method has the usual guarantees: 6 is unbiased for 7, the unnormalized target estimator is unbiased, and the self-normalized estimator is consistent as 8 (Chan et al., 5 Mar 2026). These are the paper’s formal correctness statements for the SMC estimator.
The paper also emphasizes absolute continuity: the proposal must assign nonzero probability wherever the target does. This condition is linked to the properties of the aggregation function family. An aggregation function family is defined as annihilative if zero probability at any prefix forces zero probability at the full string. Generalized means satisfy this, which makes them compatible with the chosen shaping scheme (Chan et al., 5 Mar 2026).
This condition clarifies an important point about the algorithm’s scope. Byte-level SMC is not a generic decoding heuristic detached from the target distribution; its validity depends on compatibility between the target ensemble and the proposal mechanism. A plausible implication is that the algorithm’s consistency claims are tied not only to the number of particles but also to the structural relationship between the aggregation rule and the support of the proposal.
6. Experimental settings, empirical findings, and practical implications
The paper evaluates byte-level SMC with instruction-tuned models from three families: Llama 3.1-8B-Instruct, Qwen2.5-7B-Instruct, and Phi-4 (14B) (Chan et al., 5 Mar 2026). Two ensemble settings are studied: within-model ensembles, where the same model is used with different prompts, and cross-model ensembles, where different models are used with the same prompt. The tasks are three structured generation problems: JSON Schema, BIG-Bench Hard: Word Sorting, and Spider Text-to-SQL. The experiments use 100 random instances per dataset.
Because an ensemble defines a distribution, evaluation is based on expected accuracy,
9
where 0 is the set of correct outputs (Chan et al., 5 Mar 2026). The aggregation functions compared are four generalized-mean extremes: 1, product 2, mixture or sum 3, and 4. The baselines and approximations compared are the best base model, local probability averaging, locally normalized ensemble decoding, token-level SMC, and byte-level SMC. The default configuration uses 5 particles, resampling threshold 6, equal model weights, and 5 random seeds.
The main empirical findings are presented in four parts. First, ensembling helps most when the prompts or models are complementary and each succeeds on overlapping subsets of examples. Second, consensus-seeking ensembles, especially 7 and product, consistently outperform coverage-seeking methods like sum or mixture and max (Chan et al., 5 Mar 2026). The paper states that this matches the prompt-intersection intuition: the useful ensemble mass is often on the intersection of model supports, not the union.
Third, for mixture or sum ensembles, expected accuracy is bounded by the weighted average of base accuracies: 8 so equal-weight averaging just returns the arithmetic mean of base accuracies (Chan et al., 5 Mar 2026). This directly limits what probability averaging can achieve. Fourth, the paper correlates approximation quality, measured by 9, with expected accuracy. For 0 and product, better posterior approximation tends to correlate positively with accuracy; for sum and max, the correlation is weak or even negative (Chan et al., 5 Mar 2026).
The practical implications follow directly from these results. Byte-level SMC handles mismatched tokenizers by working in byte space, gives a principled way to sample from a true global ensemble distribution over strings, supports more than probability averaging, and can improve structured generation accuracy over single models and local averaging. The limitations are equally explicit: it is much more expensive than local decoding, byte-level generation increases sequence length and SMC cost, the method currently focuses on 1 models in experiments, and it is evaluated on structured tasks with exact-match or execution-based metrics, so open-ended generation remains less explored (Chan et al., 5 Mar 2026).
Taken together, these results place byte-level SMC as a method for global ensemble inference rather than a variant of local token fusion. The paper’s bottom-line conclusion is that language-model ensembling should be defined over complete strings, not just local next-token probabilities, and that better posterior approximation improves performance when the ensemble objective is truly intersection-like (Chan et al., 5 Mar 2026).