Average Finalization Parallelism (AFP)
- AFP is a quantitative metric that measures the average number of tokens finalized per denoising step in Masked Diffusion Language Models, ranging from fully sequential to fully parallel decoding.
- It provides a hardware-agnostic evaluation of generation efficiency by capturing the trade-off between parallel processing and conditional dependency accuracy.
- Experimental analyses show AFP values spanning 1 (sequential) to n (fully parallel), guiding model design to optimize both speed and output coherence.
Average Finalization Parallelism (AFP) is a quantitative metric that measures the degree of token-level parallelism realized by Masked Diffusion LLMs (MDLMs) during sequence generation. Introduced in the context of evaluating the practical efficiency and conditional-independence properties of MDLMs, AFP captures, for a given sequence length , the average number of tokens finalized (i.e., irreversibly unmasked) per effective denoising step during parallel decoding. Its values range from 1 (fully sequential; equivalent to autoregressive decoding) to (fully parallel; all tokens finalized simultaneously), providing a hardware-agnostic assessment of intrinsic generation speedup and reflecting the model's capacity for non-sequential, any-order decoding (Zhong et al., 22 Jan 2026).
1. Formal Definition and Computation
Let be positions in the generated sequence, and be the index of the denoising step at which token is finalized. The set collects all distinct finalization steps, and the number of such steps is denoted . The Average Finalization Parallelism is then defined as
By construction:
- Autoregressive (AR) decoders yield .
- Fully parallel models yield .
Step-wise computation:
- Run MDLM diffusion decoding, recording for each token the finalization step .
- Collect .
- Compute and .
- Aggregate this statistic across instances to obtain means and distributions.
AFP thus operationalizes “how many tokens, on average, the model finalizes in each denoising step,” serving as a direct measure of realized parallelism within the generative process (Zhong et al., 22 Jan 2026).
2. Theoretical Properties and Bounds
AFP is bounded as
where the lower bound reflects pure sequentiality (as in AR decoding), and the upper bound corresponds to all tokens finalized in a single pass. In block-diffusion schemes with block size , the theoretical maximum is , since no more than tokens can be finalized per step due to the block masking mechanism.
Importantly, AFP is independent of hardware and reflects the model’s conditional-independence factorization structure. It is agnostic to runtime system details and directly measures the statistical, rather than computational, parallelism. The "no-slowdown" analysis (Theorem 4.4 in the source) shows that high AFP can be advantageous: paired with fast-mixing iterative editing, overall wall-clock generation time can be substantially reduced, even recovering some lost dependencies through resampling. The practical runtime thus satisfies a bound of , with higher AFP reducing (Zhong et al., 22 Jan 2026).
3. Experimental Methodology and Benchmarks
AFP was evaluated across:
- Models: Eight MDLMs up to 100B parameters, including LLaDA2-flash-100B, LLaDA2-mini-16B, LLaDA-1.5, Trado (4B/8B), DiRL (8B), SDAR (4B/8B/30B), Dream-7B, and OpenPangu-7B-Diffusion. AR baselines comprised Qwen3-series, OpenAI o3, and Gemini-2.5 Pro.
- Benchmarks: 58 tasks across six domains—Knowledge (MMLU), Mathematics (GSM8K, MATH), Reasoning (BBH, GPQA), Language Understanding (HellaSwag), Agentic tasks (BFCL, Nexus), and Coding (HumanEval, MBPP, CruxEval-O, CodeForces).
- Protocol: Models decoded in a unified prompt regime on a 512-GPU cluster, with AFP and generation-order statistics (Kendall’s ) measured for every output and aggregated per domain and model (Zhong et al., 22 Jan 2026).
4. Quantitative Results and Comparison with Autoregressive Models
Autoregressive models, by definition, operate fully sequentially, yielding . MDLMs, which enable blockwise or mask-based parallelism, achieve in the 1.5–4.0 range depending on block size and task. Key results include:
| Model/Block Size | AFP (approximate) | Observed Accuracy |
|---|---|---|
| LLaDA2-flash-100B, B=32 | 2.2 | Near-AR |
| LLaDA2-flash-100B, B=64 | 1.9 | Slight drop |
| LLaDA2-flash-100B, B=128 | 1.1 | Larger drop |
| OpenPangu-7B, max-1 | 1.0 | With block-4 logits |
Smaller yields higher AFP and preserves accuracy closer to AR, while larger degrades both. The extremal case (OpenPangu-7B, "max-1" strategy) demonstrates a rigid trade-off: strict sequentialization (AFP=1) but with some parallel architectural details (Zhong et al., 22 Jan 2026).
5. Variation Across Domain and Correctness
The domain-dependent statistics reveal significant AFP heterogeneity, aligning with the underlying structure and sequential dependencies of different tasks:
| Domain Type | AFP (correct, non-repetitive) | Structural Interpretation |
|---|---|---|
| Coding & Agent | 2.4–2.7 | Parallelizable spans (indentation, delimiters, boilerplate) |
| Mathematics & Reasoning | 2.1–2.3 | Requires sequential logical steps |
| Knowledge & NLU | 1.8–2.0 | Monolithic factual unfolding |
Correct samples consistently exhibit ∼10–20% higher AFP than incorrect ones, suggesting confident predictions foster bolder parallelization. Repetitive or degenerate failure modes can produce pathological AFP spikes (5–7× typical), typically accompanied by low generation-order coherence (), indicating loss of meaningful dependency modeling (Zhong et al., 22 Jan 2026).
6. Interpretations, Strengths, and Limitations
AFP exposes a fundamental trade-off: while MDLMs can adaptively parallelize structurally independent spans (leading to high AFP for code, templates, or "easy" Sudoku blanks), they remain susceptible to degraded conditional modeling at higher block sizes due to the factorization gap (as formalized in Lemma 3.1). This trade-off is reflected in:
- Strengths: Adaptive parallelism in tasks with block-wise or structural independence, non-monotonic solution orders (Kendall’s 1), and efficient filling of clearly separable regions.
- Limitations: Systematic accuracy degradation at high AFP, especially for semantic or interdependent targets, attributable to loss of inter-token dependency (the conditional total correlation bound).
A plausible implication is that optimizing for AFP alone is insufficient; model designers must balance parallelism (efficiency) against conditional-dependence (accuracy).
A promising future direction is the generate-then-edit paradigm: an initial high-AFP draft phase followed by iterative editing that selectively restores missed dependencies. Theoretical analysis substantiates that such two-stage editing can achieve both competitive wall-clock performance and higher accuracy relative to purely parallel or purely sequential baselines (Zhong et al., 22 Jan 2026).
7. Significance and Research Impact
AFP provides a precise, interpretable measure of realized sequence-level parallelism that is decoupled from hardware implementation and runtime artifacts. Its use elucidates core challenges and capabilities of MDLMs versus AR LMs, particularly in context-sensitive generation tasks spanning knowledge, reasoning, and code. By quantifying fine-grained parallelization patterns jointly with order coherence (e.g., via Kendall’s ), AFP enables principled comparisons of decoding architectures and facilitates systematic exploration of trade-offs in high-throughput generation. The metric has become central to evaluating and improving the practical utility of MDLMs, guiding design and deployment strategies for large-scale generative LLMs (Zhong et al., 22 Jan 2026).