Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning
Abstract: Recent diffusion LLMs (dLLMs) have demonstrated both effectiveness and efficiency in reasoning via a block-based semi-autoregressive generation paradigm. Despite their progress, the fixed-size block generations remain a critical bottleneck for effective and coherent reasoning. 1. From a global perspective, different reasoning tasks would correspond to different optimal decoding block sizes, which makes a ``one-size-fits-all'' assumption ineffective. 2. Even within a single reasoning task, the rigid block partitioning would break the logical flow and reduce reasoning coherence. Through empirical observations, we reveal that for block-wise entropy, incorrect reasoning exhibits a fluctuating and unsteady trend between blocks, whereas the correctly generated tasks follow a consistent descending trend. Therefore, this paper proposes b1, a novel post-training framework for dLLMs that learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective with reinforcement learning to enhance reasoning coherence.b1 integrates seamlessly as a plug-and-play module with existing dLLM's post-training algorithms. Extensive experiments across various reasoning benchmarks showcase b1's consistent improvement over existing fixed-size block baselines. Our code has been released at https://github.com/YanJiangJerry/Block-R1.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about helping a new kind of AI writing model think more clearly when solving problems. These models, called diffusion LLMs (dLLMs), don’t write one token at a time. Instead, they write in chunks (called “blocks”). The paper shows that using blocks of a fixed size can break a model’s train of thought, and proposes a smarter way: let the model choose block sizes dynamically so each block fits a full reasoning step. They teach the model to do this using a reward system that encourages clearer, more confident thinking as it goes.
What is the main problem?
Many dLLMs generate text in equal-sized blocks. That’s fast, but not always smart:
- Different tasks need different sized steps. A math calculation may need a longer step; a simple conclusion might need a shorter one.
- Fixed cut points can split a thought in the middle (like cutting “71 − 66” between blocks), causing confusion and mistakes.
The authors noticed a pattern: when the model is reasoning correctly, its uncertainty tends to go down block by block. When it’s wrong, uncertainty jumps around.
What did the researchers want to find out?
In simple terms, they asked:
- Can we let the model decide where a reasoning step ends, instead of forcing fixed-size blocks?
- Can we encourage the model to become more confident as it goes, step by step?
- Will this make the model solve math and logic problems more accurately without slowing it down much?
How did they do it? (Methods in simple language)
Think of writing an answer like building with LEGO:
- Fixed-size blocks = all pieces are one size. That’s fast, but awkward if you need a longer piece for a certain section.
- Dynamic blocks = pick the piece length that fits each part of the build.
Here’s how they made dynamic blocks work:
1) A special “end-of-step” indicator
They added a special token (like a hidden “end of step” marker) the model can place when it finishes a reasoning step. The model generates tokens in parallel inside each block until it emits this marker, then it starts the next block. This way, each block naturally contains one complete thought.
To avoid the model making super-short or meaningless steps, they give it a small reward when it produces a reasonable number of steps (not too few, not too many).
2) Teaching the model to get more confident as it goes
They measure the model’s uncertainty using “entropy,” which you can think of as a confusion meter:
- High entropy = the model isn’t sure (lots of options look equally likely).
- Low entropy = the model is confident.
The key idea: good reasoning should make the model’s confusion go down over time. So they reward the model when each block is less uncertain than the previous one (a “monotonic entropy descent”).
They also relate this to a ranking statistic (Spearman’s rank correlation) that measures whether entropy keeps going down overall. Instead of optimizing the global statistic directly (which is tricky), they use a simple, local rule: reward each step where entropy drops from one block to the next. They show that pushing these local drops leads to the same overall goal of steadily decreasing uncertainty.
3) Learning with reinforcement learning (RL)
They use RL, which is like training a pet with treats:
- If the model places the end-of-step marker sensibly and keeps uncertainty dropping block by block, it gets rewards.
- They plug these rewards into existing dLLM training methods (like GRPO-style algorithms). You can think of it as a “plug-and-play” add-on that works with current training recipes.
4) Does it slow things down?
They check training time and generation speed. The dynamic approach adds only a tiny overhead and keeps speed close to the old method, because most time is still spent in the main model computations.
What did they find? (Main results)
Across several reasoning benchmarks (like GSM8K, MATH, Sudoku, and a number puzzle called Countdown), their method consistently improved accuracy compared to fixed-size blocks. Highlights include:
- Big gains on the Countdown puzzle: up to about +19 percentage points improvement in accuracy when added on top of a strong baseline.
- Steadier “confidence curves”: more examples showed the desired “entropy goes down each step” pattern.
- More help on hard problems: for cases the baseline got wrong, the new method often increased the “confidence-descending” pattern and turned those into correct answers.
They also found the two parts of their method both matter:
- Without the “confidence should go down” reward, performance dropped.
- Without the end-of-step marker, performance also dropped.
- Together, they worked best.
Why does this matter? (Implications)
- More coherent thinking: By letting the model decide where a reasoning step ends, it keeps thoughts intact instead of chopping them mid-calculation, leading to fewer mistakes.
- General and practical: This approach fits into existing training pipelines for dLLMs with minimal extra cost.
- Better for math and logic: Tasks that need careful step-by-step reasoning benefit the most.
- A path forward: It shows that not just “what” the model writes, but “how” it structures its reasoning (dynamic steps + growing confidence) is crucial for high-quality answers.
In short, the paper shows a simple but powerful idea: let the model control its step size and reward it for becoming more certain as it goes. This makes its reasoning more natural, more accurate, and more reliable.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a focused list of what remains missing, uncertain, or unexplored in the paper, phrased to guide concrete follow-up work.
Methodological assumptions and signal choice
- The paper assumes that “monotonic entropy descent” is causal for better reasoning, but only shows correlation; it does not isolate whether enforcing descent causes improvements versus correlates with them. Design controlled interventions to test causality (e.g., enforce descent without Tend and vice versa).
- Entropy is computed under a mean-field assumption (token independence within a block) and at a single diffusion step ; the impact of these approximations on the validity of entropy as an uncertainty proxy is not evaluated. Compare to alternatives (e.g., mutual information across tokens, variance across diffusion steps, ensemble disagreement) and assess calibration (ECE/Brier) of dLLM token probabilities.
- The surrogate reward counts only adjacent entropy decreases and ignores the magnitude of decreases; two sequences with tiny vs large drops are treated equally. Evaluate magnitude-aware variants (e.g., weighted sum of positive drops, area under the entropy curve, convexity penalties) and their effect on performance.
- The reliance on a special indicator token (Tend) to align “semantic steps” is not validated against ground-truth step boundaries. Collect or leverage step-annotated data to quantify boundary precision/recall and the agreement between predicted and human-annotated step endpoints.
Reward design, stability, and failure modes
- Failure modes around the indicator token are not analyzed: missed indicators (no boundary), spurious early indicators (over-segmentation), or indicators that fall inside atomic expressions (misalignment). Instrument and report the frequency and impact of each failure case; design corrective rewards or constraints.
- The block-count reward (log-scaled with a fixed ) may encourage pathological segmentation (many small blocks) or under-segmentation depending on task. Perform task-wise sensitivity analysis over and introduce adaptive/learned targets.
- Potential reward hacking is unaddressed: the model could reduce entropy by emitting boilerplate or trivial tokens before the indicator to “game” . Add anti-degeneration checks (diversity penalties, semantic consistency checks) and measure token-level diversity and repetition.
- The paper sets reward weights without justification beyond convenience. Provide ablations to characterize stability/optimality across weightings and across RL algorithms (e.g., GRPO with/without KL).
- The non-differentiable indicator function and sparse rewards can yield high-variance policy gradients; there is no analysis of gradient variance or additional variance-reduction strategies beyond GRPO. Report gradient variance, learning curves across seeds, and test baselines like moving-average baselines or control variates tailored to and .
Theoretical guarantees and edge cases
- The equivalence between the local pairwise reward and the global Spearman objective (Theorem 1) is only stated; conditions (e.g., handling ties, noise, finite blocks) and robustness to estimation noise are not discussed. Provide full proofs with assumptions, and empirical tests on synthetic sequences with controlled monotonicity, ties, and noise.
- The claim of enforcing “strictly monotonic” entropy descent is at odds with the binary pairwise indicator that allows equality to be unpenalized; clarify whether equal entropies violate strictness and, if so, adjust the objective or the claim.
- The work does not analyze how the number of diffusion steps , the noise schedule, or temperature settings affect entropy estimates and the reward’s behavior. Systematically vary and sampling hyperparameters to assess stability and efficacy of MED.
Generalization and scope
- Evaluation is limited to a single 8B dLLM (LLaDA-8B-Instruct) and four English reasoning benchmarks (GSM8K, MATH500, Sudoku, Countdown) with short sequence lengths (256/512). Test across:
- Larger and smaller dLLMs, other architectures, and training pipelines.
- Long-context regimes (e.g., 1k–8k tokens) to assess scalability and latency when grows.
- Additional domains: code generation, scientific QA, multi-hop commonsense, multilingual and non-English reasoning, and multimodal dLLMs.
- The impact of dynamic blocks on non-reasoning tasks (summarization, open-ended generation, dialogue) is not measured. Evaluate whether MED harms creativity, fluency, or diversity in such settings.
- Zero-shot pass@1 is the only setting considered; no evaluation of few-shot prompting, pass@k sampling, or verifier-augmented pipelines. Benchmark across these common setups.
Comparisons and baselines
- Baseline coverage is incomplete: no comparisons to autoregressive LLMs with chain-of-thought planning/step tagging, step-supervised diffusion methods, or other dynamic segmentation strategies (e.g., plan-then-execute, external planners). Add baselines using supervised step boundaries and AR planning methods for a fairer assessment.
- AdaBlock-dLLM is reproduced in a zero-shot setting although it was designed for few-shot; the paper does not explore combining MED with inference-time adaptive decoding methods like Deferred Commitment Decoding (DCD). Evaluate hybrid approaches and compare under their intended settings.
Inference, formatting, and usability
- The paper hides special tokens in figures but does not clarify how Tend and intermediate step markers are handled in deployed outputs. Specify post-processing rules and measure the rate of indicator leakage into final answers and its effect on graders or downstream consumers.
- It is unclear how Tend is added to the tokenizer/vocabulary, how often it is inadvertently emitted as content, and whether it conflicts with existing tokens. Report collision rates, OOV rates, and mitigation strategies (e.g., reserved control tokens, constrained decoding).
- The behavior when no indicator is generated (e.g., fallbacks to max-length blocks) is unspecified. Define and evaluate explicit fallback policies and their effect on accuracy and latency.
Efficiency and systems considerations
- Reported overheads cover only short sequences; worst-case and long-context overheads (O(K·T·L)) and memory footprints are not characterized. Profile latency/throughput vs L, K, and T, including tail latencies and GPU memory usage.
- Dynamic block boundaries may increase the number of blocks (K), affecting wall-clock latency despite similar tokens/s; per-sample latency distributions are not reported. Provide per-sample latency and block-count distributions to quantify trade-offs.
- The approach’s compatibility with optimized kernels (e.g., FlashAttention variants) and batching/padding strategies for variable-size blocks is not discussed. Explore systems-level optimizations for dynamic parallelism.
Robustness, safety, and calibration
- Driving entropy down may reduce diversity and induce overconfidence or mode collapse; effects on calibration (ECE, NLL), diversity (distinct-n), and error detectability are not measured. Add calibration and diversity evaluations and assess any trade-offs with accuracy.
- Robustness to adversarial or out-of-distribution prompts (e.g., ambiguous math problems, noisy inputs) is unexplored. Conduct stress tests and ablations to determine whether MED encourages brittle behavior.
- Interactions with safety/toxicity/risk of spurious deterministic outputs are not studied; evaluate whether monotonic entropy descent increases the likelihood of confidently wrong outputs and assess detectability/mitigation via uncertainty thresholds.
Reproducibility and statistical rigor
- The paper reports single-run accuracies without confidence intervals or significance tests and does not present seed variance. Provide multi-seed results, standard deviations, and statistical significance analyses.
- Reward dynamics are only partially shown; more transparent training diagnostics (e.g., reward component trajectories, rscc distributions over time, collapse detection) would aid reproducibility and debugging.
Extensions and integrations
- The method is claimed to be plug-and-play but is only demonstrated with a few RL algorithms (Diffu-GRPO, GDPO, d1, wd1) and with wd1’s KL term removed. Test integration with broader families (e.g., DPO-style objectives, preference ranking with verifiers) and with/without KL regularization.
- Combining MED with verifier-guided or self-reflection pipelines (e.g., step-level verification, retry/repair) is not explored. Evaluate whether dynamic blocks yield better verifier hooks and improved verification efficiency.
- The approach assumes a single indicator type; it is unknown whether multiple, typed indicators (e.g., step-begin, calculation-end, summary) improve alignment and reasoning. Prototype multi-indicator schemes and measure segmentation quality/accuracy gains.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed now by leveraging the paper’s plug-and-play b1 framework, its Monotonic Entropy Descent (MED) reward, and the released codebase.
- Application: Reasoning performance upgrades for existing diffusion LLM deployments
- Sectors: software, education (edtech), research tooling
- What to do: Integrate b1 as a post-training RL module (e.g., add Rent and Rind to Diffu-GRPO/GDPO/wd1 pipelines) to switch from fixed-size to dynamic reasoning blocks in LLaDA-style dLLMs. Expect measurable gains on math/puzzle-style tasks (e.g., up to ~19.5% on Countdown in the paper).
- Tools/workflows: Fine-tuning pipeline update; CI for model releases includes rscc/MED diagnostics; automatic checkpoint selection by accuracy and rscc.
- Assumptions/dependencies: Requires an existing dLLM and RL post-training stack; rewardable tasks (correctness, format) for Rtask; compute budget for RL; modest changes to inference stack to honor the new indicator token.
- Application: Reasoning QA and monitoring via block-entropy analytics
- Sectors: MLOps, enterprise AI governance, academia
- What to do: Log block-wise entropy and compute rscc per generation to detect incoherent reasoning early (negative or low rscc) and flag outputs for fallback, re-sampling, or human review.
- Tools/workflows: “Reasoning QA dashboard” that plots block entropy curves, rscc distributions, and error correlations; gating rules that trigger fallback to AR models when rscc < threshold.
- Assumptions/dependencies: Access to token probabilities at inference; compatibility with privacy policies if chain-of-thought is hidden or suppressed in user-facing responses.
- Application: Hard-sample mining and active learning
- Sectors: model training/ops, academia
- What to do: Use low-rscc generations to select hard examples for further fine-tuning or data curation; prioritize samples with fluctuating or ascending block entropy for annotation or targeted training.
- Tools/workflows: Data curation pipelines that tag and route low-rscc samples; reward-shaping experiments on these subsets.
- Assumptions/dependencies: Availability of correctness signals or verifiers; storage and governance for telemetry.
- Application: Edtech math tutors with cleaner, step-aligned explanations
- Sectors: education
- What to do: Deploy b1-enhanced dLLMs to produce coherent, step-delimited solutions using the indicator token to align each block with a reasoning step; optionally display steps to learners or keep them internal for validation.
- Tools/products: Math solvers, step-by-step practice assistants, grading aids that use rscc to judge solution stability.
- Assumptions/dependencies: Domain fit (math-style tasks similar to GSM8K/MATH/Countdown); content policy for exposing or hiding chain-of-thought.
- Application: Internal decision-support in finance/ops with confidence-aware gating
- Sectors: finance, operations, business analytics
- What to do: For spreadsheet-like calculations, reconciliations, and scenario breakdowns, use b1 to improve multi-step reasoning; gate outputs via rscc to trigger audit workflows for high-stakes cases.
- Tools/workflows: “Reasoning guardrails” that combine correctness checks with MED; report generation that annotates reasoning stability.
- Assumptions/dependencies: Availability of structured correctness checks; regulatory constraints on automated decision support.
- Application: Tool-call orchestration between reasoning steps
- Sectors: software agents, enterprise automation
- What to do: Treat the dynamic end-of-step indicator as a reliable hook to schedule external tool calls (e.g., calculators, search, DB queries) exactly at semantic step boundaries.
- Tools/workflows: Agent frameworks that subscribe to Tend events to invoke tools; budgets that cap K via Ktarget to control latency/cost.
- Assumptions/dependencies: Tool APIs; adherence to latency budgets; careful reward design so indicators reflect true step boundaries.
- Application: Inference-time fallback policies for reliability
- Sectors: platform engineering, customer-facing AI products
- What to do: If rscc is low or the local entropy drops (Rent) fail in early blocks, automatically switch to a safer but slower pathway (e.g., smaller blocks, more diffusion steps, or an AR model).
- Tools/workflows: Multi-policy routers; A/B tests comparing productivity vs. reliability.
- Assumptions/dependencies: Multiple deployed policies available; consistent entropy instrumentation.
- Application: Research diagnostics and benchmarking
- Sectors: academia, AI labs
- What to do: Use MED/rscc as a protocol alongside accuracy to compare reasoning interventions, ablations, and hyperparameters; report rscc bins vs. accuracy to characterize robustness.
- Tools/workflows: Benchmark harnesses that calculate rscc, TMED, and correlate with task outcomes.
- Assumptions/dependencies: Access to logits/probabilities and block segmentation.
- Application: Privacy-preserving internal CoT with external concise answers
- Sectors: enterprise AI, customer support
- What to do: Keep dynamic blocks and step indicators internal for validation and quality control, while returning short-form final answers externally to avoid CoT disclosure risks.
- Tools/workflows: Serve-two-views approach (internal CoT + external concise response) with rscc as an internal confidence measure.
- Assumptions/dependencies: Policy alignment on CoT usage; logging and access controls.
Long-Term Applications
Below are forward-looking opportunities that require further research, scaling, domain adaptation, or integration beyond current math/puzzle benchmarks.
- Application: Clinical reasoning and guideline-conformant decision support
- Sectors: healthcare
- Vision: Use dynamic step boundaries to align with clinical decision pathways; enforce MED and step validations to reduce diagnostic incoherence.
- Tools/products: Clinical copilot with “reasoning blocks” mapped to guideline steps; entropy-based alerts for uncertainty.
- Dependencies/assumptions: High-quality medical reward signals and verifiers; rigorous validation; regulatory approval; risk management for CoT exposure.
- Application: Robotics/task planning with step-anchored execution
- Sectors: robotics, manufacturing, logistics
- Vision: Map dynamic reasoning blocks to executable subplans; trigger perception/control modules at Tend boundaries; reject plans with non-monotonic entropy.
- Tools/workflows: Plan-and-act stacks where Tend triggers action execution and sensors provide feedback for RL.
- Dependencies/assumptions: Grounded simulators and real-world feedback; safety constraints; multi-modal dLLMs for perception-language-action.
- Application: Multimodal dynamic reasoning (vision, speech, code)
- Sectors: multimodal AI, accessibility, software engineering
- Vision: Extend MED to multi-channel reasoning (e.g., visual grounding, speech understanding, code refactoring) where step boundaries sync across modalities.
- Tools/products: Multimodal assistants that chunk complex tasks into stable steps; IDE copilots with dynamic planning and low-entropy proof steps.
- Dependencies/assumptions: Generalization of entropy measures to non-text tokens; multi-modal RL rewards; UI integration.
- Application: Standardized “reasoning stability” metrics for policy and compliance
- Sectors: policy, AI assurance, compliance
- Vision: Adopt rscc/MED as audit metrics in governance frameworks to assess the reliability of multi-step automated reasoning, particularly for high-stakes use.
- Tools/workflows: Certification checklists; conformance tests that require positive rscc distributions under domain stress tests.
- Dependencies/assumptions: Consensus on metric definitions; stress-test suites; updated regulation acknowledging internal signals.
- Application: Model-agnostic dynamic step segmentation (beyond dLLMs)
- Sectors: foundation models broadly
- Vision: Adapt the MED principle to AR or masked models (e.g., “virtual blocks” in AR via sliding windows) to achieve similar coherence gains.
- Tools/workflows: Middleware that estimates per-span entropy trends and inserts virtual Tend markers; RL or RAFT variants to reinforce monotonicity.
- Dependencies/assumptions: Reliable entropy proxies for AR models; training-time/inference-time hooks without prohibitive latency.
- Application: Reasoning compilers and graph-of-thought execution engines
- Sectors: software agents, data engineering
- Vision: Treat dynamic blocks as nodes in a reasoning DAG; compile steps into verifiable subroutines with local correctness checks before merging.
- Tools/products: “Reasoning compiler” that translates block sequences into executable pipelines; cache-and-reuse of validated sub-results.
- Dependencies/assumptions: Verifiers and subtask oracles; robust step-local rewards; interoperability with tool ecosystems.
- Application: Cost/latency-optimized reasoning via adaptive step budgeting
- Sectors: cloud AI platforms, energy-efficiency initiatives
- Vision: Dynamically modulate K (via Rind and Ktarget) and block sizes to meet SLA or energy budgets while preserving MED; auto-tune diffusion steps for low-entropy convergence.
- Tools/workflows: Controllers that trade off accuracy vs. cost using rscc forecasts; green AI reporting of entropy vs. energy curves.
- Dependencies/assumptions: Reliable cost–quality models; hardware/runtime support for adaptive scheduling; domain-specific SLAs.
- Application: Safety filters that prevent escalation of faulty chains
- Sectors: safety-critical AI, legal and compliance tech
- Vision: Abort or re-route reasoning when early blocks fail local entropy drops (Rent) or rscc trends upward; combine with adversarial probes.
- Tools/workflows: Pre-deployment red teaming using entropy diagnostics; runtime “safe-interrupt” based on monotonicity violations.
- Dependencies/assumptions: Calibrated thresholds per domain; comprehensive coverage against adversarial cases.
- Application: Curriculum/reward shaping for complex domains
- Sectors: scientific discovery, engineering design
- Vision: Use MED as a scaffold for curricula that gradually increase reasoning complexity; co-train with domain verifiers (e.g., theorem checkers, simulators).
- Tools/workflows: Iterative RL where rscc gates progression; synthetic data generation that targets entropy stabilization.
- Dependencies/assumptions: Strong domain verifiers; scalable simulation; careful avoidance of CoT leakage when inappropriate.
- Application: Marketplace and tooling for entropy-aware model diagnostics
- Sectors: AI tooling ecosystem
- Vision: Commercial tools that integrate with popular inference servers to compute block entropy, rscc, and provide actionable remediations.
- Tools/products: Plugins for Triton/vLLM/Transformers; SDKs for logging and alerting; “EntropyScope” dashboards for teams.
- Dependencies/assumptions: Vendor cooperation for logits access; standard telemetry schemas; privacy-by-design practices.
Notes on feasibility across applications:
- Generalization risk: The strongest evidence is on math/puzzle reasoning; domain transfer (e.g., legal/medical) needs new reward functions and rigorous validation.
- Architectural dependency: The method targets diffusion LLMs with block-based generation; applying to AR models requires adaptation.
- Instrumentation: Access to token distributions is necessary to compute entropy and rscc; some production stacks do not expose logits by default.
- Chain-of-thought policy: Exposing step-by-step text may be sensitive; many applications should keep steps internal and only use indicators/metrics for QA and control.
- Compute: RL post-training introduces additional cost; however, the paper reports negligible training overhead relative to baseline dLLM training and competitive inference throughput.
Glossary
- Advantage: In policy optimization, a baseline-corrected estimate of return used to weight updates, often computed relative to a group or batch baseline. "Âg represents the advantage of the g-th sequence normalised against the group average of G completions."
- Autoregressive (AR) models: LLMs that generate text token-by-token, each conditioned on previous tokens. "have emerged as compelling alternatives to autoregressive (AR) models."
- Block-based generation: A decoding strategy that partitions a sequence into blocks and generates the tokens within each block in parallel. "Such block-based generation (Nie et al., 2025; Zhu et al., 2025a; Arriola et al., 2025) demonstrates substantial potential"
- Block ending indicator: A special token that marks the end of a reasoning step to define dynamic block boundaries. "Block Ending Indicator Reward."
- Block entropy: The average uncertainty (Shannon entropy) of token distributions within a block, used to measure confidence. "The block entropy is calculated as the mean token-wise entropy within the same block."
- Block entropy reward: A reinforcement signal encouraging entropy to decrease across blocks to promote coherent reasoning. "a surrogate block entropy reward Rent is proposed."
- d1: A diffusion-LLM reinforcement learning framework using GRPO-style optimization. "For instance, d1 (Zhao et al., 2025) and wd1 (Tang et al., 2025) formulate the diffusion-based GRPO objective as:"
- Denoising cross-entropy loss: A training objective for diffusion LLMs that predicts original tokens from masked/noisy inputs using cross-entropy. "The training objective can be formulated as a denoising cross- entropy loss"
- Diffu-GRPO: A diffusion-based variant of GRPO used as a post-training objective for dLLMs. "including Diffu-GRPO (Zhao et al., 2025), GDPO (Rojas et al., 2025) and wd1 (Tang et al., 2025)"
- Diffusion LLMs (dLLMs): LLMs that generate text by reversing a masking/noising process, enabling parallel token generation. "Diffusion LLMs (dLLMs) (Zhu et al., 2025a; Ye et al., 2025; Arriola et al., 2025)"
- Diffusion steps: The discrete iterations in the denoising process used to reconstruct tokens. "For each block bk, the dLLM performs T diffusion steps to denoise all masked tokens"
- Diffusion time: A continuous variable indexing the level of corruption in the diffusion process. "Here t ~ U[0, 1] represents continuous diffusion time"
- Dynamic-size reasoning blocks: Blocks whose lengths adapt to reasoning step boundaries rather than being fixed-size. "learns dynamic-size reasoning blocks via a Monotonic Entropy Descent objective"
- GDPO: Group Diffusion Policy Optimization, an RL method for dLLMs focusing on stable optimization. "GDPO (Rojas et al., 2025)"
- Group Relative Policy Optimisation (GRPO): An RL algorithm that uses group-relative baselines to stabilize policy optimization. "particularly Group Relative Policy Optimisation (GRPO), as a post- training technique"
- Indicator reward: An auxiliary reward encouraging the model to emit sufficient block-ending indicators for multi-step reasoning. "an indicator reward Rind is proposed."
- Kullback–Leibler (KL) divergence: A regularization term penalizing deviation from a reference policy during RL fine-tuning. "β controls the strength of the KL divergence penalty with respect to the reference policy Tref."
- MASK token: A special placeholder symbol used to denote masked tokens during diffusion training. "with a special [MASK] token."
- Mean-field formulation: An independence assumption across tokens within a parallel decoding step, used to simplify entropy computation. "Following the mean-field formulation in block- based dLLMs"
- Monotonic Entropy Descent (MED): An objective encouraging block entropies to decrease strictly across the reasoning process to enhance coherence. "introduces Monotonic Entropy Descent (MED) via a block entropy reward"
- Negative Spearman's Rank Correlation Coefficient: The negated Spearman’s rho over block entropies, measuring the strength of monotonic descent. "Definition 1: Negative Spearman's Rank Correlation Coefficient for Block Entropy"
- Parallel token generation: Generating multiple tokens simultaneously rather than one-by-one. "dLLMs generate tokens in a parallel manner"
- Pass@1: A metric indicating whether the first generated completion solves the task. "Zero-shot pass@1 results are reported"
- Plug-and-play module: A component that can be integrated into existing systems without modifying their core algorithms. "b1 functions as a versatile, plug-and-play module"
- Policy ratio: The ratio of current to old policy probabilities used in clipped policy-gradient updates. "ri (0) denotes the policy ratio"
- Reference policy: The baseline policy distribution used for KL-regularization in RL objectives. "with respect to the reference policy Tref."
- Semi-autoregressive generation: A hybrid decoding scheme that generates blocks sequentially but tokens within each block in parallel. "a block-based semi- autoregressive generation paradigm."
- Self-attention: The mechanism enabling each token to attend to others; it dominates compute cost with quadratic scaling in sequence length. "dominated by self-attention that scales quadratically with the sequence length."
- Shannon entropy: The standard information-theoretic measure of uncertainty of a probability distribution. "H(.) denotes the Shannon entropy over the full vocabulary distribution"
- Spearman's Rank Correlation Coefficient: A nonparametric statistic measuring monotonic association between two ranked variables. "minimise Spearman's Rank Correlation Coeffi- cient (Charles, 1904)"
- Token-wise entropy: Entropy computed for the predictive distribution of each token. "mean token-wise entropy"
- wd1: A weighted policy optimization method tailored for reasoning with diffusion LLMs. "wd1 (Tang et al., 2025)"
- Zero-shot: An evaluation setting where the model receives no task-specific examples at inference time. "Zero-shot pass@1 results are reported"
Collections
Sign up for free to add this paper to one or more collections.