Behavior of βN under stochastic sampling in the mid-range regime

Determine the behavior, including the probability of correct copying, of a decoder-only Transformer with compact position embeddings on the bitstring βN—defined as the sequence obtained by flipping a single bit of the baseline sequence αN to lie within the continuity threshold—under stochastic output sampling in the regime where Ny < 1 and NεN ≈ 1, with αN copied with confidence 1 − y and εN denoting the smallest continuity parameter attainable at size N.

Background

The paper studies the bitstring copy task using a decoder-only Transformer with compact position embeddings (CPE), leveraging continuity results to construct a confounding sequence βN that differs from a baseline αN by a single bit flip. Under greedy decoding, βN is not correctly copied despite potentially very low log-perplexity.

Extending to stochastic sampling, the authors fold sampling temperature into the confidence parameter y and analyze error probabilities using Boole’s inequality and continuity. They bound the probability that the model produces αN in response to βN by N(y + ε), and then identify three regimes based on Ny and NεN.

In the mid-range regime where Ny < 1 and NεN ≈ 1, their theory does not directly apply, leading to an explicit gap: the model copies αN with high probability, but its behavior on βN cannot be concretely characterized by the provided analysis. This unresolved case motivates determining the model’s behavior on βN in that specific regime.

References

Ny < 1, NεN ≈ 1: The model copies an with high probability, and the sequence is not long enough for our theory to apply. In this case, we are unable to make concrete claims about the model's behaviour on 3N.

Perplexity Cannot Always Tell Right from Wrong  (2601.22950 - Veličković et al., 30 Jan 2026) in Section 3.3