Papers
Topics
Authors
Recent
2000 character limit reached

Token-DiFR: LLM Inference Verification & Compression

Updated 2 December 2025
  • Token-DiFR is a framework for quantifying token-level divergence to verify LLM inference correctness and assess model compression fidelity.
  • It leverages seed-controlled Gumbel-Max sampling to synchronize randomness, ensuring that any deviation in token selection signals potential model misconfiguration.
  • The metric provides a normalized, sensitive measure for detecting subtle discrepancies in generation, aiding in rigorous auditing and evaluation of model modifications.

Token-DiFR (Token-Divergence-From-Reference) is a framework for verifying the correctness of sequence generation in LLM inference, particularly in scenarios with inherent nondeterminism due to floating-point noise and stochastic sampling. Token-DiFR is also defined as an analytic metric for quantifying the stability and fidelity of compressed or modified LLMs by comparing token-level generation divergence relative to a reference model. The term is prominent in two distinct but related technical lineages: as a cryptographically lightweight verification protocol for inference correctness (Karvonen et al., 25 Nov 2025), and as a normalized metric for quantifying generation similarity under model compression (Deiseroth et al., 2023). Both usages draw on the foundational idea of aligning and evaluating tokens or distributions at each generation step, but operate under distinct methodological regimes.

1. Inference Verification: Motivation and Core Principle

LLM inference is notoriously nondeterministic across runs, due to sensitivity to GPU kernel variations, floating-point accumulations, or different batching/scheduling strategies. As such, bitwise comparison or naive re-execution of the inference pipeline is brittle—outputs may legitimately differ even when all “correct” implementations are used. This complicates third-party auditing, model attribution, and security in outsourced inference scenarios. Existing alternatives, such as cryptographic zero-knowledge proofs, remain computationally impractical for real-time or high-throughput LLM workloads.

Token-DiFR provides a rigorous verification scheme that exploits deterministic control over the randomness injected into the sampling process via explicit strong synchronization of the PRNG seed. With provider and verifier anchored to a common sampling seed, the only valid source of variation is benign floating-point noise in the logits. This reframes each output token as direct evidence of the provider's correct procedure—any systematic or large deviations signal possible misconfiguration, tampering, or subtle implementation flaws (Karvonen et al., 25 Nov 2025).

2. Sampling-Seed Synchronization and the Gumbel-Max Trick

The technical foundation for Token-DiFR-based verification is seed-controlled Gumbel-Max sampling. Assume the LLM parameterizes a distribution over vocabulary tokens via a logit vector lRVl \in \mathbb{R}^V, with temperature T>0T > 0. At each generation step, Gumbel noise gGumbel(0,1)Vg \sim \mathrm{Gumbel}(0,1)^V is added to ll, and the sampled token is selected by:

t=argmaxi[li+Tgi]t = \arg\max_{i} [l_i + T \cdot g_i]

Fixing the PRNG seed σ\sigma ensures that both verifier and provider can reconstruct the same gg. In this configuration, any deviation in the token sequence generated by the provider (conditional on the same prompt, model weights, hyperparameters, and seed) must arise from non-benign differences in model implementation, weight corruption, quantization, or illicit post-processing (Karvonen et al., 25 Nov 2025).

Without synchronization, only distributional statistics can be assessed—requiring many samples, high variance, and vulnerability to adversarial “stat-matching” or arbitrary stochastic hacks.

3. Formal Metric: Token-DiFR Margin

Token-DiFR operationalizes the “distance from reference” at token level via the Token-DiFR margin. Let tt^* be the provider’s generated token at position jj, ff be the token the verifier would have sampled using the same gg, and zi=li+Tgiz_i = l_i + T g_i. The token-level margin is:

ΔTokenDiFR(t,f;g)=min{[lf+Tgf][lt+Tgt],Δmax}\Delta_{\mathrm{TokenDiFR}}(t^*, f; g) = \min\left\{[l_f + T g_f] - [l_{t^*} + T g_{t^*}],\, \Delta_{\max}\right\}

where Δmax\Delta_{\max} is a ceiling to mitigate extreme outliers. Δ=0\Delta=0 if and only if t=ft^*=f. Margins grow as the provider’s selected token becomes less probable (under shared noise) relative to the reference pick. If tt^* is pruned by top-kk or top-pp at verification, it is treated as ++\infty before capping (Karvonen et al., 25 Nov 2025).

4. Verification Workflows

The canonical Token-DiFR verification process (Karvonen et al., 25 Nov 2025):

  1. Synchronization: Provider and client agree on model, tokenizer, hyperparameters (including softmax temperature, top-kk, top-pp), and seed σ\sigma.
  2. Generation: Provider generates tokens {y1,...,yn}\{y_1, ..., y_n\}, recording σ\sigma.
  3. Replay: Verifier reconstructs all logits for prompt + generations, and re-derives the Gumbel noise gg from σ\sigma.
  4. Scoring:
    • For each position jj, compute provider margin Δj\Delta_j as above.
    • Optionally compute negative log-likelihoods and strict matches.
  5. Aggregation: Pool the per-token margins (e.g., by mean or a robust percentile).
  6. Decision: Compare the aggregate score to a threshold τ\tau calibrated on “benign” (honest) runs; flag if exceeded.

In practice, over 98% of tokens match exactly under honest operation. Subtle or systemic misbehavior is exposed by compiling the modest divergences across many tokens.

5. Empirical Sensitivity and Robustness

Token-DiFR offers high sample efficiency and sensitivity for several key classes of violations:

  • Quantization: For Llama-3.1-8B, distinguishing 4-bit quantization from bf16 at AUC >0.999> 0.999 within 300 tokens (AUC = 0.99 at 100 tokens) (Karvonen et al., 25 Nov 2025).
  • Sampling Bugs: Simulation of rare but catastrophic sampling errors (e.g., uniform sampling among top-kk 1% of the time) is detected with 99%+ AUC within a few thousand tokens.
  • Adversarial Robustness: Unlike cross-entropy detectors, adversarial tuning (e.g., manipulating temperature to match NLL) cannot consistently circumvent Token-DiFR when the sampling seed is public, as the actual sampled sequence under shared noise is hard to fake.

Pooling statistics across tokens increases reliability and reduces vulnerability to adversarial evasion.

Token-DiFR assumes:

  • Access to open weights, tokenizer, and a reference implementation; it is not applicable to closed-source model endpoints.
  • Provider exposure of per-request seed control; this is straightforward in frameworks like vLLM but may require API changes elsewhere.
  • Calibration is required for tolerance due to hardware-induced floating-point variance; optimal detection thresholds are data/hardware-specific.

A notable extension is Activation-DiFR, which enables compact verification of activations themselves via random orthogonal projections, allowing sample-efficient auditing (e.g., detecting 4-bit quantization at AUC >0.999>0.999 in as few as 2 tokens) while reducing communication overhead by 25–75% relative to legacy fingerprinting (Karvonen et al., 25 Nov 2025).

When temperature T=0T=0 (“greedy”), Token-DiFR reduces to simple per-token argmax checks, which offer immediate spot auditing in practical open-source deployments, though are susceptible to selective behavior by the provider.

7. Token-DiFR as a Metric for Compression Robustness

In the context of model compression, quantization, and pruning, Token-DiFR is defined as the Token Divergence Fraction Retained:

$\mathrm{Token\mbox{-}DiFR}_\tau(M') = \frac{\mathrm{FDTM}_\tau(M, M')}{\mathrm{FDTM}_\tau(M, M)}$

where FDTMτ\mathrm{FDTM}_\tau is the First Divergent Token Metric at divergence threshold τ\tau between an original model MM and a compressed variant MM' (Deiseroth et al., 2023).

  • $\mathrm{Token\mbox{-}DiFR}_\tau \approx 1$ indicates that the compressed model remains indistinguishable from the reference for as many tokens as the reference does against itself—i.e., minimal degradation.
  • Values near zero indicate almost immediate divergence.

Token-DiFR provides an interpretable, unitless measure: if $\mathrm{Token\mbox{-}DiFR}\geq 0.95$, compression preserves nearly all generation fidelity at the token trajectory level. It is strictly more sensitive than perplexity or strict token-level accuracy, because it captures where and how compressed models shift their entire predictive distribution, with direct relevance to global generation quality and downstream utility.

To compute Token-DiFR in practice:

  1. Fix threshold τ\tau (e.g., τ=0.75\tau=0.75 total variation).
  2. Compute FDTMτ\mathrm{FDTM}_\tau for both uncompressed and compressed models.
  3. Divide to yield the fraction.

For further technical details and comparative data, see the originating Divergent Token Metrics paper (Deiseroth et al., 2023).


Token-DiFR thus designates both a verification protocol for high-assurance LLM inference under stochasticity, and a rigorous metric for quantifying and managing fidelity losses in model compression workflows. By anchoring verification and measurement at the token level, it enables highly efficient, precise, and practical auditing of both inference procedures and model modifications across the LLM landscape.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Token-DiFR.