LLM-42: Deterministic LLM Inference

Updated 28 January 2026

LLM-42 is a scheduler-driven framework that enforces bit-level determinism on GPU-based LLM inference without modifying existing high-performance kernels.
It employs a decode–verify–rollback protocol that speculatively decodes tokens under dynamic batching and selectively verifies outputs with a fixed batch shape.
Experimental results demonstrate that LLM-42 maintains high throughput and low latency while efficiently handling deterministic workloads with minimal rollback overhead.

LLM-42 is a scheduler-driven framework designed to achieve bit-level deterministic inference for LLMs on modern GPU hardware, without requiring kernel rewrites or incurring prohibitive performance penalties. It enforces reproducibility by integrating speculative token decoding with a lightweight “verify–rollback” mechanism, maintaining peak throughput for non-deterministic requests and imposing overhead only in proportion to deterministic workload demand (Gond et al., 25 Jan 2026).

1. Motivation for Determinism in LLM Inference

Modern GPU-accelerated LLM inference is fundamentally non-deterministic at the bit level, even when launched with identical models, prompts, and sampling hyper-parameters. This non-determinism is primarily attributed to three system-level factors:

Floating-point non-associativity: IEEE 754 arithmetic does not guarantee associativity of addition and multiplication:

$(a + b) + c \neq a + (b + c)$

As a result, different summation or reduction orders yield minute numerical discrepancies.

Dynamic batching: GPU inference engines maximize utilization via dynamic batching, wherein queries are coalesced on-the-fly, causing batch sizes—and thus kernel execution paths—to vary between runs.
Variable reduction order in GPU kernels: High-performance GPU kernels (e.g., BLAS GEMM, attention, normalization) select reduction trees based on input shapes, varying “split-K,” tiling, or specialization strategies. These pipeline-dependent reduction orders further amplify divergence between otherwise identical invocations.

Disabling dynamic batching or enforcing a single “batch-invariant” reduction order can restore determinism, but at the cost of sharply reduced throughput (30–60% lower on GEMMs, 50% lower on RMSNorm operators) and significant engineering overhead to maintain distinct kernel stacks. These approaches also inflict fixed runtime penalties regardless of the actual frequency of deterministic requests, motivating the need for a more targeted, flexible solution.

2. System Overview and Key Observations

LLM-42 is predicated on two empirical insights:

Sequence consistency: If two executions yield identical outputs up to step $t$ (i.e., bit-exact token sequences and KV-cache states), the subsequent token will, with high probability, also be consistent, due to the typical rarity of floating-point drifts that actually cross token sampling thresholds.
Shape-consistent reductions: Most GPU kernels, once input shapes (batch size, sequence length) are fixed, apply the chosen reduction order uniformly across the batch. Thus, within a verification window of fixed shape, kernel execution is deterministic.

System structure: LLM-42 performs speculative decoding of tokens on a fast path under dynamic batching. After every window of $T$ tokens, it invokes a verifier that replays these tokens under a fixed batch shape (the “verification batch”). If all speculative tokens are verified bit-exact, they are committed; on a mismatch, the system rolls back to the divergence point and decodes from there.

The architecture thus enables determinism to be enforced at specific points, efficiently amortizing the cost of verification over long consistent regions, and preserving optimal throughput where determinism is not required.

3. Decode–Verify–Rollback Protocol

The core of LLM-42 is the Decode–Verify–Rollback (DVR) protocol, parameterized by window size $T$ and maintaining explicit KV-cache and token sequence tracking.

Formal Definitions

Let:

$Q$ : user prompt.
$\hat{y}_0$ : start-of-sequence token.
$S_t$ : KV-cache state after $t$ tokens.
$k(S)$ : reduction schedule chosen by the GPU kernel for $\text{shape}(S)$ .
$\pi_{\mathrm{fast}}(S_t)$ : fast-path decoding function under dynamic batching, returning a candidate token $\hat{y}_{t+1}'$ and new state $S_{t+1}'$ .
$\pi_{\mathrm{ver}}(S_t, \hat{Y}_{t+1:t+T}')$ : verifier that recomputes the next $T+1$ tokens with a fixed batch shape.

DVR Protocol

Initialization: $S_0$ from deterministic prefill on $Q$ , $t \gets 0$ .
Speculate: Decode $T$ tokens under dynamic batching:

$(\hat{y}_{t+i}', S_{t+i}') \gets \pi_{\mathrm{fast}}(S_{t+i-1}'),\quad i = 1..T$

Verify: Replay the window under fixed shape:

$(Y, \tilde{S}_{t+T+1}) \gets \pi_{\mathrm{ver}}(S_t, \hat{Y}')$

where $Y = (y_{t+1}, ..., y_{t+T+1})$ .

Mismatch detection: Find first $j$ with $y_{t+j} \neq \hat{y}_{t+j}'$ (if none, $j = T+1$ ).
Commit or rollback: Emit $y_{t+1},...,y_{t+j}$ ; update fast-path cache with $\tilde{S}$ ; set $t \leftarrow t+j$ , $S_t \leftarrow \tilde{S}_t$ . Repeat until EOS.

Correctness: The verifier is fully deterministic by construction (fixed batch shape), and only tokens matching its output are committed. All others are rejected and recomputed, ensuring reproducibility.

4. GPU Kernel Handling and Overhead Analysis

LLM-42 preserves all high-performance GPU kernels (e.g., cuBLAS GEMM, FlashAttention, fused RMSNorm) unchanged. The fast path is fully dynamic and batch-optimized, while the verifier simply reruns the same computation at a fixed shape by padding as needed. This eliminates the need for dual kernel stacks or global slowdowns.

Let:

$T_{\mathrm{fast}}$ : time for fast-path decoding.
$T_{\mathrm{ver}}$ : time per verification window.
$p_{\mathrm{miss}}$ : fraction of tokens requiring verification.
$p_{\mathrm{fail}}$ : probability that a verification window yields a mismatch.

Total expected inference time per token:

$T_{\mathrm{total}} = T_{\mathrm{fast}} + p_{\mathrm{miss}} \times T_{\mathrm{ver}} + p_{\mathrm{miss}} \, p_{\mathrm{fail}} \times T_{\mathrm{fast,\,recompute}}$

with $T_{\mathrm{fast,\,recompute}}$ the cost of recomputing rolled-back tokens. Empirically, $p_{\mathrm{fail}} \ll 1$ , so recompute overhead is negligible in practice. As $p_{\mathrm{miss}}\rightarrow 0$ , the system approaches the throughput of the fully non-deterministic fast path.

5. Experimental Results

Experiments are reported on Llama-3.1-8B-Instruct evaluated on H100 PCIe GPUs, using the SGLang serving engine. Three configurations are compared:

SGLang-Non-Deterministic: optimized kernels, no determinism.
SGLang-Deterministic: all kernels replaced with batch-invariant versions (60% slower GEMM, 50% slower RMSNorm).
LLM-42: non-deterministic SGLang with DVR scheduler and grouped verification.

Offline throughput (tokens/sec, batch size 8, mean output 192 tokens):

Deterministic fraction	SGLang-Non-Deterministic	SGLang-Deterministic	LLM-42
0%	931	–	932
10%	–	–	914
20%	–	–	903
50%	–	–	869
100%	–	415	421

LLM-42 achieves up to a 2.2× throughput improvement over SGLang-Deterministic when only a small fraction (<10%) of traffic demands determinism, with minimal throughput reduction (2% at 10% deterministic).

Online end-to-end latency (P50 / P90 at 12 QPS):

Mode	P50 (s)	P90 (s)
SGLang-Non-Deterministic	2.15	5.03
SGLang-Deterministic	4.64	13.2
LLM-42 @ 2% deterministic	2.21	5.25
LLM-42 @ 10% deterministic	2.79	5.40
LLM-42 @ 50% deterministic	3.85	6.81
LLM-42 @ 100% deterministic	4.54	7.39

Rollback frequency remains low: with 100% deterministic traffic and 4,096 requests, only 96 rollbacks are observed (∼0.02 per request), and just 0.32% of tokens require recomputation. These results confirm that LLM-42's cost is strictly proportional to deterministic workload.

6. Comparison with Prior Approaches

Conventional methods for enforcing determinism rely on disabling dynamic batching or deploying only batch-invariant kernels. These induce global slowdowns (30–60% for GEMMs, 50% for RMSNorm), require maintaining two kernel codebases, and penalize every query, including those not requiring determinism.

In contrast, LLM-42:

Operates entirely atop the existing high-performance kernel stack.
Preserves dynamic batching and hardware–software co-optimization on the fast path.
Restricts overhead to only those segments of traffic requiring deterministic outputs.
Requires no new kernel development, only a scheduler, a fixed-shape verifier, and cache patching for rollbacks.

This division of labor enables production LLM serving engines to balance peak throughput, low latency, and reproducibility without systemic compromises.

LLM-42 thus provides a principled and efficient methodology for bit-exact reproducibility in GPU-based LLM inference by speculative decoding and selective verification, aligned with modern system-level optimization strategies (Gond et al., 25 Jan 2026).

Markdown Upgrade to Chat

References (1)

LLM-42: Enabling Determinism in LLM Inference with Verified Speculation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LLM-42.

LLM-42: Deterministic LLM Inference

1. Motivation for Determinism in LLM Inference

2. System Overview and Key Observations

3. Decode–Verify–Rollback Protocol

Formal Definitions

DVR Protocol

4. GPU Kernel Handling and Overhead Analysis

5. Experimental Results

6. Comparison with Prior Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

LLM-42: Deterministic LLM Inference

1. Motivation for Determinism in LLM Inference

2. System Overview and Key Observations

3. Decode–Verify–Rollback Protocol

Formal Definitions

DVR Protocol

4. GPU Kernel Handling and Overhead Analysis

5. Experimental Results

6. Comparison with Prior Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research