Papers
Topics
Authors
Recent
Search
2000 character limit reached

Average Finalization Parallelism (AFP)

Updated 29 January 2026
  • AFP is a quantitative metric that measures the average number of tokens finalized per denoising step in Masked Diffusion Language Models, ranging from fully sequential to fully parallel decoding.
  • It provides a hardware-agnostic evaluation of generation efficiency by capturing the trade-off between parallel processing and conditional dependency accuracy.
  • Experimental analyses show AFP values spanning 1 (sequential) to n (fully parallel), guiding model design to optimize both speed and output coherence.

Average Finalization Parallelism (AFP) is a quantitative metric that measures the degree of token-level parallelism realized by Masked Diffusion LLMs (MDLMs) during sequence generation. Introduced in the context of evaluating the practical efficiency and conditional-independence properties of MDLMs, AFP captures, for a given sequence length nn, the average number of tokens finalized (i.e., irreversibly unmasked) per effective denoising step during parallel decoding. Its values range from 1 (fully sequential; equivalent to autoregressive decoding) to nn (fully parallel; all tokens finalized simultaneously), providing a hardware-agnostic assessment of intrinsic generation speedup and reflecting the model's capacity for non-sequential, any-order decoding (Zhong et al., 22 Jan 2026).

1. Formal Definition and Computation

Let {1,,n}\{1, \ldots, n\} be positions in the generated sequence, and ciNc_i \in \mathbb{N} be the index of the denoising step at which token ii is finalized. The set S={c1,,cn}S = \{c_1,\ldots, c_n\} collects all distinct finalization steps, and the number of such steps is denoted Teff=ST_{\mathrm{eff}} = |S|. The Average Finalization Parallelism is then defined as

AFP:=nTeff.\mathrm{AFP} := \frac{n}{T_{\mathrm{eff}}}.

By construction:

  • Autoregressive (AR) decoders yield AFP1\mathrm{AFP} \approx 1.
  • Fully parallel models yield AFP=n\mathrm{AFP} = n.

Step-wise computation:

  1. Run MDLM diffusion decoding, recording for each token ii the finalization step cic_i.
  2. Collect S={c1,,cn}S = \{c_1, \ldots, c_n\}.
  3. Compute Teff=ST_{\mathrm{eff}} = |S| and AFP=n/Teff\mathrm{AFP} = n / T_{\mathrm{eff}}.
  4. Aggregate this statistic across instances to obtain means and distributions.

AFP thus operationalizes “how many tokens, on average, the model finalizes in each denoising step,” serving as a direct measure of realized parallelism within the generative process (Zhong et al., 22 Jan 2026).

2. Theoretical Properties and Bounds

AFP is bounded as

1AFPn1 \leq \mathrm{AFP} \leq n

where the lower bound reflects pure sequentiality (as in AR decoding), and the upper bound corresponds to all tokens finalized in a single pass. In block-diffusion schemes with block size BB, the theoretical maximum is BB, since no more than BB tokens can be finalized per step due to the block masking mechanism.

Importantly, AFP is independent of hardware and reflects the model’s conditional-independence factorization structure. It is agnostic to runtime system details and directly measures the statistical, rather than computational, parallelism. The "no-slowdown" analysis (Theorem 4.4 in the source) shows that high AFP can be advantageous: paired with fast-mixing iterative editing, overall wall-clock generation time can be substantially reduced, even recovering some lost dependencies through resampling. The practical runtime thus satisfies a bound of Tstep(m)O(log(1/δ)/(1α(m)))T_\mathrm{step}(m)\cdot O(\log(1/\delta)/(1-\alpha(m))), with higher AFP reducing Tstep(m)T_\mathrm{step}(m) (Zhong et al., 22 Jan 2026).

3. Experimental Methodology and Benchmarks

AFP was evaluated across:

  • Models: Eight MDLMs up to 100B parameters, including LLaDA2-flash-100B, LLaDA2-mini-16B, LLaDA-1.5, Trado (4B/8B), DiRL (8B), SDAR (4B/8B/30B), Dream-7B, and OpenPangu-7B-Diffusion. AR baselines comprised Qwen3-series, OpenAI o3, and Gemini-2.5 Pro.
  • Benchmarks: 58 tasks across six domains—Knowledge (MMLU), Mathematics (GSM8K, MATH), Reasoning (BBH, GPQA), Language Understanding (HellaSwag), Agentic tasks (BFCL, Nexus), and Coding (HumanEval, MBPP, CruxEval-O, CodeForces).
  • Protocol: Models decoded in a unified prompt regime on a 512-GPU cluster, with AFP and generation-order statistics (Kendall’s τ\tau) measured for every output and aggregated per domain and model (Zhong et al., 22 Jan 2026).

4. Quantitative Results and Comparison with Autoregressive Models

Autoregressive models, by definition, operate fully sequentially, yielding AFP1.0\mathrm{AFP} \approx 1.0. MDLMs, which enable blockwise or mask-based parallelism, achieve AFP\mathrm{AFP} in the 1.5–4.0 range depending on block size and task. Key results include:

Model/Block Size AFP (approximate) Observed Accuracy
LLaDA2-flash-100B, B=32 2.2 Near-AR
LLaDA2-flash-100B, B=64 1.9 Slight drop
LLaDA2-flash-100B, B=128 1.1 Larger drop
OpenPangu-7B, max-1 1.0 With block-4 logits

Smaller BB yields higher AFP and preserves accuracy closer to AR, while larger BB degrades both. The extremal case (OpenPangu-7B, "max-1" strategy) demonstrates a rigid trade-off: strict sequentialization (AFP=1) but with some parallel architectural details (Zhong et al., 22 Jan 2026).

5. Variation Across Domain and Correctness

The domain-dependent statistics reveal significant AFP heterogeneity, aligning with the underlying structure and sequential dependencies of different tasks:

Domain Type AFP (correct, non-repetitive) Structural Interpretation
Coding & Agent 2.4–2.7 Parallelizable spans (indentation, delimiters, boilerplate)
Mathematics & Reasoning 2.1–2.3 Requires sequential logical steps
Knowledge & NLU 1.8–2.0 Monolithic factual unfolding

Correct samples consistently exhibit ∼10–20% higher AFP than incorrect ones, suggesting confident predictions foster bolder parallelization. Repetitive or degenerate failure modes can produce pathological AFP spikes (5–7× typical), typically accompanied by low generation-order coherence (τ0\tau \approx 0), indicating loss of meaningful dependency modeling (Zhong et al., 22 Jan 2026).

6. Interpretations, Strengths, and Limitations

AFP exposes a fundamental trade-off: while MDLMs can adaptively parallelize structurally independent spans (leading to high AFP for code, templates, or "easy" Sudoku blanks), they remain susceptible to degraded conditional modeling at higher block sizes due to the factorization gap (as formalized in Lemma 3.1). This trade-off is reflected in:

  • Strengths: Adaptive parallelism in tasks with block-wise or structural independence, non-monotonic solution orders (Kendall’s τ<\tau<1), and efficient filling of clearly separable regions.
  • Limitations: Systematic accuracy degradation at high AFP, especially for semantic or interdependent targets, attributable to loss of inter-token dependency (the conditional total correlation bound).

A plausible implication is that optimizing for AFP alone is insufficient; model designers must balance parallelism (efficiency) against conditional-dependence (accuracy).

A promising future direction is the generate-then-edit paradigm: an initial high-AFP draft phase followed by iterative editing that selectively restores missed dependencies. Theoretical analysis substantiates that such two-stage editing can achieve both competitive wall-clock performance and higher accuracy relative to purely parallel or purely sequential baselines (Zhong et al., 22 Jan 2026).

7. Significance and Research Impact

AFP provides a precise, interpretable measure of realized sequence-level parallelism that is decoupled from hardware implementation and runtime artifacts. Its use elucidates core challenges and capabilities of MDLMs versus AR LMs, particularly in context-sensitive generation tasks spanning knowledge, reasoning, and code. By quantifying fine-grained parallelization patterns jointly with order coherence (e.g., via Kendall’s τ\tau), AFP enables principled comparisons of decoding architectures and facilitates systematic exploration of trade-offs in high-throughput generation. The metric has become central to evaluating and improving the practical utility of MDLMs, guiding design and deployment strategies for large-scale generative LLMs (Zhong et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Average Finalization Parallelism (AFP).