Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens

Published 12 Jun 2026 in cs.LG | (2606.14620v1)

Abstract: Open diffusion LLMs are marketed as parallel, non-autoregressive decoders, yet the order in which a shipped checkpoint actually commits its tokens is almost never measured. We instrument DiffusionGemma 26B, a masked discrete-diffusion mixture-of-experts model built on Gemma 4, hooking its sampler's accept step to record which canvas positions commit, when, and at what confidence. Across a 686-prompt, six-regime probe suite we find that its decoding is neither parallel nor block-autoregressive: it follows a partial left-to-right commit bias whose apparent strength depends almost entirely on the granularity at which you look. Order is weak token by token and strengthens smoothly as the analysis is coarsened, so the model's "block size" turns out to be an artifact of the measuring ruler rather than the architecture. The model commits in large simultaneous batches, leaving much of the within-batch order genuinely undefined rather than merely unobserved. The behaviour is regime-dependent: structured JSON is committed in essentially arbitrary order, and a position's commit confidence tracks correctness on mathematical reasoning but carries no signal on factual recall. Commitment is aggressive, finishing in a short late burst well inside the step budget, while task accuracy matches the model's autoregressive Gemma-4 sibling. Beyond these findings, our central contribution is methodological: measuring decoding order honestly demands handling trailing-EOS padding, within-regime confounding, commit non-monotonicity, block-size sensitivity, and large commit-batch ties, each of which can otherwise manufacture a decoding-order result that is not really there.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a rigorous empirical analysis showing that DiffusionGemma finalizes token commitments in a partially ordered, granularity-dependent manner.
It employs precise instrumentation to log token acceptances, analyze entropy confidence, and compare commit orders across diverse generation regimes.
Findings reveal that despite non-sequential, large batch commitments, DiffusionGemma achieves accuracy comparable to autoregressive models.

Empirical Analysis of Token Commitment in DiffusionGemma

Overview

This study presents a rigorous empirical investigation into the actual token commitment order of DiffusionGemma, a masked discrete diffusion mixture-of-experts LLM based on Gemma~4. While diffusion LLMs (DLMs) are frequently described as parallel decoders, the concrete sequence and granularity with which tokens are finalized during sampling has not been systematically measured in shipped checkpoints. Leveraging inference-time instrumentation, the authors provide quantitative characterizations of the commit process, its dependence on generation regimes, and the efficacy of entropy-bound commit confidence, as well as comparative analysis with autoregressive (AR) models.

Methodology

Instrumentation is applied directly to the EntropyBoundSampler of DiffusionGemma (25.2B/3.8B parameters, public google/ checkpoint), enabling precise logs of accept calls: which token positions are finalized at each accept, and the associated entropy. The analysis covers 686 diverse prompts across six regimes: math (GSM8K), code completion and synthesis (HumanEval, MBPP), factual recall, open-ended instructions, and structured JSON. Main metrics include tie-aware Kendall $\tau_b$ correlation between commit index and left-to-right position at both token- and block-granularities, batch sizes, within-batch tie statistics, and AUROC of commit entropy as a confidence correctness signal. Bootstrapped confidence intervals are obtained via a five-seed strengthening run, with methodical controls for EOS-padding artifacts and regime-specific confounds.

Findings: Decoding Order Properties

A primary finding is that DiffusionGemma exhibits a partial, granularity-dependent left-to-right commit bias, with substantial deviation from either pure left-to-right (autoregressive) or strict block-autoregressive behavior. Token-level $\tau_b$ values are moderate ($0.43-0.60$ in prose, code, math, factual), and a block-size sweep reveals that block- $\tau_b$ increases smoothly with bin size without a special status for 16-token blocks. A strict block-sequential control yields $\tau_b \approx 0.94-0.96$ , highlighting genuine sub-block disorder in model behavior.

Commit batches are large ( $\sim$ 13-26 content tokens/accept-call), leading to high fractions of token pairs being tied (up to $0.72$ for JSON), and commit order must therefore be interpreted at the batch level rather than at single-token granularity.

Figure 1: Decoding order in the strengthening run. Left: Moderate left-to-right bias in prose/code/math/factual, well below sequential decoding; JSON regime is order-independent. Right: block- $\tau_b$ increases smoothly with bin size $B$ , without a privileged value at $16$.

In constrained JSON generation, the process is approximately order-independent (token- $\tau_b$ 0, 95% CI $\tau_b$ 1), further supporting regime dependence. The findings invalidate the notion of DiffusionGemma as block-autoregressive with a specific architectural block size, instead identifying a partially ordered, highly batched commit process whose sequentiality is both incomplete and variable across output constraints.

Regime-Specific Confidence Calibration

The regime-dependence is also manifest in the reliability of commit confidence. On math (GSM8K), negative entropy-at-commit is a robust predictor of correctness (AUROC $\tau_b$ 2, 95% CI $\tau_b$ 3), with monotonic decline in accuracy across increasing entropy tertiles. On factual recall, however, entropy fails as a confidence signal (AUROC $\tau_b$ 4, CI includes $\tau_b$ 5; flat reliability curve), warning against regime-blind pooling. For structured JSON, accuracy is perfect (no errors), so AUROC is undefined. Pooling across regimes induces a Simpson's paradox effect, reversing genuine within-regime trends.

Figure 2: Reliability of entropy-based commit confidence by tertile. Math (GSM8K) exhibits monotonic discrimination; factual recall does not.

Importantly, aggressive, entropy-bounded early stopping is observed: the model often finishes in far fewer than the allowed 48 steps (3.3–17.1 accept-calls depending on regime), and at entropies substantially below the 0.1 threshold, reflecting a commit process that is both decisive and conservative in practice.

Accuracy Versus Autoregressive Sibling

Across the scorable regimes (math, factual, JSON), DiffusionGemma's accuracy is comparable to Gemma-4 26B-A4B, the AR sibling, with minor variation within the range of statistical uncertainty (e.g., $\tau_b$ 6 vs $\tau_b$ 7 for GSM8K). This demonstrates that the observed commit-order properties do not substantially degrade task performance under practical sampling configurations.

Methodological Contributions

A notable contribution is the careful catalog of artifacts and pitfalls relevant to empirical commit-order measurement: EOS-pad effects, batch-sized commit resolution, non-monotone position commitments (“un-accept” events are common), bin-size dependency, and pooled-statistic pathologies. The authors advocate for content-only, regime-specific, and tie-aware order analyses, and provide protocols for robust prompt clustering, seeds, and reproducibility.

Implications and Future Directions

The heterogenous, partially ordered commit dynamics uncovered here propose that diffusion LMs, in realistic checkpoints and shipped configurations, do not realize the theoretical ideal of parallel, order-independent decoding. Instead, their behavior interpolates between AR and block-AR limits, in a way that is regime- and granularity-sensitive but sufficient for competitive accuracy. This nuance bridges contradictions in prior literature between left-to-right-biased and “anchor-first” generation findings, emphasizing that both model and sampler jointly determine empirical properties.

For users and practitioners, these results caution against assuming either strict parallelization – which would imply maximal throughput and minimal latency – or clean blockwise factorization. For model designers, the prevalence of “unresolved” within-batch order and non-monotonic commits highlights open areas for algorithmic refinement and further explorations of commit confidence signals. The aggressive early stopping and modest reliance on available denoising steps suggest potential inefficiencies or untapped capacity.

The instrumentation and analysis framework introduced establishes a template for forensic evaluation of other DLMs, contributing to the reproducibility and interpretability of non-AR generation pipelines.

Conclusion

By leveraging precise instrumentation of DiffusionGemma’s commit mechanism, this work provides the first detailed empirical map of commit order and its regime dependence in a production diffusion LM. The results reveal a partial, granularity-dependent left-to-right bias, substantial sub-block disorder, aggressive—and sometimes regime-misaligned—confidence-based commitment, and competitive downstream accuracy. The origin and implications of large, unresolved commit batches and non-monotone commitment warrant continued theoretical and engineering investigation, especially as DLMs are increasingly deployed across tasks with varying structure constraints.

Markdown Report Issue