LLM-42: Deterministic LLM Inference
- LLM-42 is a scheduler-driven framework that enforces bit-level determinism on GPU-based LLM inference without modifying existing high-performance kernels.
- It employs a decode–verify–rollback protocol that speculatively decodes tokens under dynamic batching and selectively verifies outputs with a fixed batch shape.
- Experimental results demonstrate that LLM-42 maintains high throughput and low latency while efficiently handling deterministic workloads with minimal rollback overhead.
LLM-42 is a scheduler-driven framework designed to achieve bit-level deterministic inference for LLMs on modern GPU hardware, without requiring kernel rewrites or incurring prohibitive performance penalties. It enforces reproducibility by integrating speculative token decoding with a lightweight “verify–rollback” mechanism, maintaining peak throughput for non-deterministic requests and imposing overhead only in proportion to deterministic workload demand (Gond et al., 25 Jan 2026).
1. Motivation for Determinism in LLM Inference
Modern GPU-accelerated LLM inference is fundamentally non-deterministic at the bit level, even when launched with identical models, prompts, and sampling hyper-parameters. This non-determinism is primarily attributed to three system-level factors:
- Floating-point non-associativity: IEEE 754 arithmetic does not guarantee associativity of addition and multiplication:
As a result, different summation or reduction orders yield minute numerical discrepancies.
- Dynamic batching: GPU inference engines maximize utilization via dynamic batching, wherein queries are coalesced on-the-fly, causing batch sizes—and thus kernel execution paths—to vary between runs.
- Variable reduction order in GPU kernels: High-performance GPU kernels (e.g., BLAS GEMM, attention, normalization) select reduction trees based on input shapes, varying “split-K,” tiling, or specialization strategies. These pipeline-dependent reduction orders further amplify divergence between otherwise identical invocations.
Disabling dynamic batching or enforcing a single “batch-invariant” reduction order can restore determinism, but at the cost of sharply reduced throughput (30–60% lower on GEMMs, 50% lower on RMSNorm operators) and significant engineering overhead to maintain distinct kernel stacks. These approaches also inflict fixed runtime penalties regardless of the actual frequency of deterministic requests, motivating the need for a more targeted, flexible solution.
2. System Overview and Key Observations
LLM-42 is predicated on two empirical insights:
- Sequence consistency: If two executions yield identical outputs up to step (i.e., bit-exact token sequences and KV-cache states), the subsequent token will, with high probability, also be consistent, due to the typical rarity of floating-point drifts that actually cross token sampling thresholds.
- Shape-consistent reductions: Most GPU kernels, once input shapes (batch size, sequence length) are fixed, apply the chosen reduction order uniformly across the batch. Thus, within a verification window of fixed shape, kernel execution is deterministic.
System structure: LLM-42 performs speculative decoding of tokens on a fast path under dynamic batching. After every window of tokens, it invokes a verifier that replays these tokens under a fixed batch shape (the “verification batch”). If all speculative tokens are verified bit-exact, they are committed; on a mismatch, the system rolls back to the divergence point and decodes from there.
The architecture thus enables determinism to be enforced at specific points, efficiently amortizing the cost of verification over long consistent regions, and preserving optimal throughput where determinism is not required.
3. Decode–Verify–Rollback Protocol
The core of LLM-42 is the Decode–Verify–Rollback (DVR) protocol, parameterized by window size and maintaining explicit KV-cache and token sequence tracking.
Formal Definitions
Let:
- : user prompt.
- : start-of-sequence token.
- : KV-cache state after tokens.
- : reduction schedule chosen by the GPU kernel for .
- : fast-path decoding function under dynamic batching, returning a candidate token and new state .
- : verifier that recomputes the next tokens with a fixed batch shape.
DVR Protocol
- Initialization: from deterministic prefill on , .
- Speculate: Decode tokens under dynamic batching:
- Verify: Replay the window under fixed shape:
where .
- Mismatch detection: Find first with (if none, ).
- Commit or rollback: Emit ; update fast-path cache with ; set , . Repeat until EOS.
Correctness: The verifier is fully deterministic by construction (fixed batch shape), and only tokens matching its output are committed. All others are rejected and recomputed, ensuring reproducibility.
4. GPU Kernel Handling and Overhead Analysis
LLM-42 preserves all high-performance GPU kernels (e.g., cuBLAS GEMM, FlashAttention, fused RMSNorm) unchanged. The fast path is fully dynamic and batch-optimized, while the verifier simply reruns the same computation at a fixed shape by padding as needed. This eliminates the need for dual kernel stacks or global slowdowns.
Let:
- : time for fast-path decoding.
- : time per verification window.
- : fraction of tokens requiring verification.
- : probability that a verification window yields a mismatch.
Total expected inference time per token:
with the cost of recomputing rolled-back tokens. Empirically, , so recompute overhead is negligible in practice. As , the system approaches the throughput of the fully non-deterministic fast path.
5. Experimental Results
Experiments are reported on Llama-3.1-8B-Instruct evaluated on H100 PCIe GPUs, using the SGLang serving engine. Three configurations are compared:
- SGLang-Non-Deterministic: optimized kernels, no determinism.
- SGLang-Deterministic: all kernels replaced with batch-invariant versions (60% slower GEMM, 50% slower RMSNorm).
- LLM-42: non-deterministic SGLang with DVR scheduler and grouped verification.
Offline throughput (tokens/sec, batch size 8, mean output 192 tokens):
| Deterministic fraction | SGLang-Non-Deterministic | SGLang-Deterministic | LLM-42 |
|---|---|---|---|
| 0% | 931 | – | 932 |
| 10% | – | – | 914 |
| 20% | – | – | 903 |
| 50% | – | – | 869 |
| 100% | – | 415 | 421 |
LLM-42 achieves up to a 2.2× throughput improvement over SGLang-Deterministic when only a small fraction (<10%) of traffic demands determinism, with minimal throughput reduction (2% at 10% deterministic).
Online end-to-end latency (P50 / P90 at 12 QPS):
| Mode | P50 (s) | P90 (s) |
|---|---|---|
| SGLang-Non-Deterministic | 2.15 | 5.03 |
| SGLang-Deterministic | 4.64 | 13.2 |
| LLM-42 @ 2% deterministic | 2.21 | 5.25 |
| LLM-42 @ 10% deterministic | 2.79 | 5.40 |
| LLM-42 @ 50% deterministic | 3.85 | 6.81 |
| LLM-42 @ 100% deterministic | 4.54 | 7.39 |
Rollback frequency remains low: with 100% deterministic traffic and 4,096 requests, only 96 rollbacks are observed (∼0.02 per request), and just 0.32% of tokens require recomputation. These results confirm that LLM-42's cost is strictly proportional to deterministic workload.
6. Comparison with Prior Approaches
Conventional methods for enforcing determinism rely on disabling dynamic batching or deploying only batch-invariant kernels. These induce global slowdowns (30–60% for GEMMs, 50% for RMSNorm), require maintaining two kernel codebases, and penalize every query, including those not requiring determinism.
In contrast, LLM-42:
- Operates entirely atop the existing high-performance kernel stack.
- Preserves dynamic batching and hardware–software co-optimization on the fast path.
- Restricts overhead to only those segments of traffic requiring deterministic outputs.
- Requires no new kernel development, only a scheduler, a fixed-shape verifier, and cache patching for rollbacks.
This division of labor enables production LLM serving engines to balance peak throughput, low latency, and reproducibility without systemic compromises.
LLM-42 thus provides a principled and efficient methodology for bit-exact reproducibility in GPU-based LLM inference by speculative decoding and selective verification, aligned with modern system-level optimization strategies (Gond et al., 25 Jan 2026).