Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning Generation Orders for Masked Discrete Diffusion Models via Variational Inference

Published 27 Feb 2026 in cs.LG | (2602.23968v1)

Abstract: Masked discrete diffusion models (MDMs) are a promising new approach to generative modelling, offering the ability for parallel token generation and therefore greater efficiency than autoregressive counterparts. However, achieving an optimal balance between parallel generation and sample quality remains an open problem. Current approaches primarily address this issue through fixed, heuristic parallel sampling methods. There exist some recent learning based approaches to this problem, but its formulation from the perspective of variational inference remains underexplored. In this work, we propose a variational inference framework for learning parallel generation orders for MDMs. As part of our method, we propose a parameterisation for the approximate posterior of generation orders which facilitates parallelism and efficient sampling during training. Using this method, we conduct preliminary experiments on the GSM8K dataset, where our method performs competitively against heuristic sampling strategies in the regime of highly parallel generation. For example, our method achieves 33.1\% accuracy with an average of only only 4 generation steps, compared to 23.7-29.0\% accuracy achieved by standard competitor methods in the same number of steps. We believe further experiments and analysis of the method will yield valuable insights into the problem of parallel generation with MDMs.

Summary

  • The paper introduces a variational inference framework designed to learn adaptive generation orders for masked discrete diffusion models.
  • It factorizes the token generation into an unmasking schedule and conditional token prediction, employing temperature scaling and Rao-Blackwellization for low-variance training.
  • Empirical results on GSM8K show a 33.1% accuracy in low-step regimes, outperforming heuristic baselines in efficiency and sample quality.

Learning Generation Orders for Masked Discrete Diffusion Models via Variational Inference

Context and Motivation

Masked discrete diffusion models (MDMs) have established themselves as a potent class of generative models for discrete sequences, as evidenced in text, code, and biological data domains. The central appeal of MDMs lies in their capacity for parallel token generation, providing inference time acceleration compared to strictly autoregressive models while maintaining competitive sample quality (Nie et al., 2024). The challenge, however, is in striking an optimal trade-off between parallelism and sample quality. Overly aggressive parallelization introduces statistical dependencies that can degrade generation quality, while conservative approaches undermine the efficiency benefits.

Existing strategies predominantly rely on heuristic decoding schedules or metric-driven token selection (e.g., top-k probabilities, logit margins), but these methods remain limited in their flexibility and theoretical grounding. Learning-based approaches to generation order have shown improvement, but their integration into MDMs via a variational framework has not been rigorously explored. This paper seeks to fill that gap by introducing a variational inference methodology for learning token generation orders in MDMs, explicitly formulated to facilitate parallelism and low-variance training.

Technical Contributions

The paper proposes a latent variable model for MDMs that factorizes the generation process into:

  • Token position selection (unmasking schedule).
  • Conditional token value generation.

This decomposition enables the application of variational inference for learning generation orders. The key technical mechanisms are:

  • An explicit parameterization of the approximate posterior for generation order, implemented as a neural network that outputs unmasking probabilities.
  • The derivation of the ELBO objective tailored to this architecture, exploiting conditional independence and Rao-Blackwellization for variance reduction.
  • Temperature scaling in the neural scoring function to regularize the unmasking schedule and maintain non-trivial batch-level training signals during early optimization phases.

Experimental Results

The method is empirically evaluated on GSM8K, a standard mathematical reasoning dataset. The core finding is that, in low-step regimes (T=5T=5), the learned generation schedule achieves 33.1% accuracy with only 4 steps, outperforming the best heuristic baselines (which achieve 23.7–29.0% accuracy at equivalent step budgets). The adaptive scheduling learned by the variational posterior maintains competitive performance in higher-step settings, sometimes matching or slightly trailing logit margin heuristics as TT increases.

The obtained results quantitatively demonstrate that learned generation order strategies can outperform fixed or simple randomized schedules in highly-parallel settings. Notably, the adaptive nature of the approach allows dynamic adjustment of the degree of parallelism, avoiding common pitfalls of excessive parallelism-induced error as documented in recent literature (Kang et al., 6 Oct 2025).

Implications and Future Prospects

Theoretical Implications

The variational inference formulation for generation order brings new theoretical clarity to the token selection problem in MDMs, connecting the learned parallel schedule to latent-variable optimization in sequence denoising diffusion models. The approach formalizes the connection between order selection and sample quality, potentially serving as a foundation for further exploration of deeper factorization strategies, joint optimization landscapes, and information-theoretic bounds on parallel decoding.

Practical Implications

For practitioners deploying MDMs in latency-sensitive environments (e.g., LLM inference or biological sequence generation), this method offers an avenue to increase generation speed without sacrificing accuracy by dynamically balancing parallelism. The architecture’s modular design enables straightforward integration with existing diffusion model training pipelines. Furthermore, it facilitates rapid experimentation with posterior parameterizations, promising improved scalability to more diverse and larger datasets.

Speculative Directions

Future research will likely explore:

  • Scaling the approach to larger models, longer sequences, and more complex token dependencies, leveraging richer neural parameterizations for the approximate posterior.
  • Extending variational order learning to other discrete generative paradigms, such as any-order autoregressive models (Wang et al., 7 Mar 2025), conditional edit flows (Havasi et al., 10 Jun 2025), and speculative decoding frameworks (Guo et al., 29 Apr 2025).
  • Investigating joint learning of both denoising and order models within unified architectures, and evaluating performance across a broader range of downstream tasks with varying dependency structures.

Conclusion

This paper introduces a principled variational inference framework for learning generation orders in masked discrete diffusion models, targeting the optimization of parallel decoding efficiency and sample quality. The approach demonstrates empirical superiority over heuristic baselines in low-step generation regimes and offers a flexible, low-variance training scheme for adaptive token selection. The contributions are poised to support future advances in sequence modeling, parallel decoding, and the broader development of discrete diffusion-based generative architectures (2602.23968).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper is about teaching a kind of AI model—called a masked discrete diffusion model (MDM)—to decide the best order to fill in missing pieces of text. Think of a sentence with many blanks. The model fills in the blanks over several rounds. If it chooses the right blanks to fill first (and which words to put there), it can finish faster and make fewer mistakes. The authors propose a new, principled way to learn that fill-in order.

What questions are the authors asking?

  • Can we learn a smart plan (order) for which blanks to reveal at each round, instead of using simple rules?
  • Can we keep filling many blanks at once (parallel generation) without causing the model to make more mistakes?
  • Can we train this planning skill in a statistically sound way (using variational inference) so it scales to bigger models and datasets?

How does their method work?

The basic idea of masked diffusion

  • Imagine a scratch-off puzzle: start with a fully covered sentence (all tokens are “masked”), then uncover parts over several rounds until the full sentence appears.
  • An MDM works in reverse: at each round, it picks some positions to unmask and guesses the tokens for them. Doing many at once is fast, but too many at once can cause errors because the guesses may depend on each other.

Two jobs, two mini-models

The authors split the job into two parts and train both:

  1. A selector: decides which positions to unmask this round (which blanks to reveal).
  2. A filler (denoiser): predicts the actual token for each chosen position.

This split is important: it lets the system control how much it does in parallel and in what order.

Learning the order with variational inference (VI)

  • VI is a training method that treats hidden choices (like “which positions should we unmask next?”) as latent variables. You build:
    • An “approximate posterior” (think: a teacher plan) that proposes good orders during training.
    • A “prior”/policy (the student selector) that will be used at test time.
  • The training goal (called the ELBO) balances two things:
    • Help the filler confidently predict correct tokens (so it learns to fill well when the selector reveals positions).
    • Make the student selector imitate the teacher’s good orders, so the same kind of orders work at test time.

In everyday terms: during training, a coach (teacher plan) suggests smart reveal orders that make the model’s predictions easy and accurate. At the same time, the player (student selector) practices to match those suggestions. After training, the player can perform well on their own.

How the teacher plan proposes orders

  • For each position, a small neural network gives a score (how early it should be revealed).
  • A simple “renormalize and temperature-scale” step turns these scores into probabilities for unmasking at each round.
  • This design:
    • Allows multiple positions to be revealed in the same round (parallelism).
    • Encourages a clear order (higher score → reveal earlier).
    • Guarantees at least one position is revealed each round (no wasted steps).
    • Is efficient: one quick pass gives all scores; small updates handle later rounds.

Keeping training stable

  • Because the teacher plan also has learned parts, they use a standard trick (REINFORCE with a leave-one-out baseline) to reduce training noise. In simple terms, they average over several tries and subtract the average to make updates steadier.

What did they test and what did they find?

  • Dataset: GSM8K (grade-school math word problems).
  • Base model: a 170M-parameter MDM trained on this task.
  • They compared their learned reveal order against common “rule-based” strategies:
    • IID: reveal a random set of masked positions each round.
    • Top Probability: reveal the positions the model is most confident about.
    • Top Probability Margin: reveal positions where the model’s top guess is much better than its second-best guess.

Key result: with very few rounds (fast decoding), their method was better.

  • With about 4 rounds on average, their method reached 33.1% accuracy, compared to 23.7–29.0% for the baselines at the same cost.
  • With larger budgets (more rounds), the gap shrank. For example, at a 10-step budget, their method was similar to the best baseline; in one setting a baseline slightly edged it at the maximum steps, but their method still beat baselines at the same average steps. This suggests learned ordering helps most when you need to be fast (few rounds), where bad parallel choices hurt quality.

Why does this matter?

  • Faster text generation: Learning a good order lets the model fill in many blanks at once without getting confused, reducing the number of rounds needed.
  • Better quality-speed trade-off: You get higher accuracy for the same speed, or similar accuracy with fewer rounds.
  • A principled foundation: Framing order learning as variational inference is a clean, scalable approach that can extend to larger models and different tasks (like code or biological sequences).
  • Less reliance on shaky confidence scores: Rule-based methods depend heavily on raw model confidence, which can be misleading. A learned selector can do better by considering more global clues.

Key terms in plain language

  • Token: A small piece of text (like a word or subword).
  • Mask: A special “blank” symbol that hides a token.
  • Parallel generation: Filling multiple blanks in the same round.
  • Variational inference (VI): A way to train models with hidden choices by learning an easy-to-sample “teacher” that guides training.
  • ELBO: A score the model tries to maximize during VI; it rewards accurate predictions and penalizes mismatches between teacher and student.
  • KL divergence: A measure of how different two probability distributions are; smaller is better.
  • Bernoulli variable: A coin flip (yes/no) used here to decide if a position is revealed this round.

Final takeaway

The paper shows that learning “which blanks to reveal, and when” can make masked diffusion models both faster and more accurate—especially when you want to finish in just a few rounds. By using a principled training setup, the model learns a reveal plan that works well at test time, bringing us closer to fast, high-quality text generation.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to be concrete and actionable for future researchers.

  • Lack of theoretical guarantees: No analysis of when the proposed ELBO formulation yields optimal or near-optimal unmasking policies, or conditions under which the learned P_ψ(r_t | x_{t+1}) is identifiable and consistent with the true optimal generation order.
  • Independence assumptions on r_t: The posterior and policy treat r_t^n as i.i.d. Bernoulli across positions, leaving unexplored whether modeling dependencies (e.g., structured priors, DPPs, CRFs, autoregressive couplings over indices) could better manage mutual information and reduce over-parallelization errors.
  • Posterior expressiveness: The chosen posterior q_φ^{t,n}(x_{t+1}, x_0) is effectively a static per-sequence schedule based on α(x_0) and mask state, not leveraging step-wise denoiser signals or uncertainty; alternatives that incorporate μ_θ’s confidence or entropy are not studied.
  • Training–inference mismatch: The posterior Q_φ(r_t | x_{t+1}, x_0) uses ground-truth x_0, but inference relies on P_ψ(r_t | x_{t+1}) without x_0; the paper does not quantify how much this mismatch (bridged only via a KL term) hurts downstream performance or how to mitigate it (e.g., annealed KL, consistency regularization).
  • KL weighting strategy: The ELBO includes a KL term between Q_φ and P_ψ, but the paper does not report the presence or tuning of a weighting coefficient, nor ablations showing its effect on stability, convergence, and accuracy.
  • Variance reduction claims: Rao-Blackwellization and RLOO are cited to reduce variance, yet the paper lacks empirical measurements (gradient variance, training stability curves) or theoretical bounds to validate and compare their effectiveness against alternatives (e.g., RELAX, doubly reparameterized estimators).
  • Alternative gradient estimators: The method uses REINFORCE; continuous relaxations (e.g., Gumbel-Softmax for r_t) or hybrid estimators are not investigated as potential routes to lower-variance, faster-converging training.
  • Temperature parameterization: The choice and schedule of the temperature τ in q_φ^{t,n} are heuristic; the paper lacks sensitivity analyses, adaptive temperature schemes, or principled criteria for setting τ to balance training signal and parallelism.
  • Minimum/target unmasking budget per step: The posterior ensures “at least one” token unmasking via Max normalization, but does not control per-step K or adapt K to confidence; exploring constraints or learned K schedules to manage mutual information and step efficiency is left open.
  • Objective design for mutual information: Although related work decomposes error into denoiser error and mutual information among simultaneously generated tokens, the paper does not propose or evaluate explicit MI penalties or entropy-bounded unmasking constraints in the VI objective.
  • Denoiser calibration: The critique of heuristic logit-based confidence is not followed by any calibration analysis of μ_θ; the method does not evaluate whether the learned order improves calibration or whether calibration-aware objectives (e.g., temperature scaling, focal losses) help.
  • Posterior family exploration: The paper notes experimenting with posterior forms but reports only Eq. 11; systematic exploration of alternative Q_φ parameterizations (e.g., conditioning on x_{t+1}, μ_θ signals, or graph-structured priors over indices) is missing.
  • Ablation studies: Key components lack ablations—impact of RLOO sample count, KL term weight, freezing vs. finetuning the denoiser, alpha-network capacity/architecture, noise schedule hyperparameters, and sequence length.
  • Complexity and efficiency: There is no wall-clock evaluation of training/inference overhead introduced by the auxiliary networks and RLOO sampling, nor profiling of memory/latency vs. baseline decoders at matched accuracy or step budgets.
  • Fairness of baseline comparison: Baseline continues finetuning with larger batch diversity and uses linear schedules; the paper does not equalize data exposure or explore stronger baselines (e.g., PC-Sampler, entropy-bounded unmasking, RL-based unmasking policies) to situate its gains among learned competitors.
  • Generalization across tasks and scales: Results are limited to GSM8K with a 170M model; the method’s behavior on longer sequences, diverse domains (code, open-ended text, biological sequences), and larger MDMs remains untested.
  • Robustness to sequence length and vocabulary: The approach’s scalability with large N and vocab sizes is not analyzed; potential degradation due to i.i.d. r_t or variance in training signals for long sequences is unknown.
  • Interaction with diffusion discretization T: The method’s sensitivity to the time discretization, and whether it learns sequential orders when T is small or prefers larger T for parallelism, is not characterized.
  • Adaptive step allocation: The method yields variable step counts per sample, but there is no analysis of how step budget correlates with sample difficulty, nor strategies to learn difficulty-aware budget allocation policies.
  • Learned order interpretability: No qualitative analysis of which positions are chosen early/late, alignment with semantic structure (e.g., numbers, operators, reasoning spans), or consistency across samples.
  • Quantitative uncertainty measures: Beyond accuracy, the paper does not report likelihoods, calibration metrics, confidence intervals, or statistical significance of gains, limiting the robustness of conclusions.
  • Sensitivity to pretraining: The denoiser is pretrained; it is unclear whether the proposed training remains effective from scratch or how pretraining quality modulates gains from learned generation orders.
  • Compatibility with alternative MDM training: The framework is not evaluated under different MDM training regimes (e.g., edit-based flows, stochastic optimal control samplers) to test compatibility and benefits.
  • Theoretical treatment of ELBO decomposition: Some derivations in the appendix contain typographical/notation issues and elide details (e.g., L_0, closed-form steps), which impedes reproducibility and formal verification.
  • Hyperparameter disclosure and reproducibility: Architectural details for α(·) and p_ψ^{t,n}(·), optimizer settings, learning-rate schedules, and code availability are not provided, hindering replication and comparative studies.

Practical Applications

Below are practical, real-world applications that follow directly from the paper’s findings, methods, and innovations on learning parallel generation orders for masked discrete diffusion models (MDMs) via variational inference. Each item names specific use cases, links to sectors, and notes tools/workflows and feasibility assumptions.

Immediate Applications

  • Adaptive parallel decoding plugin for diffusion LLMs (software)
    • Use case: Replace heuristic unmasking strategies (IID/top-prob/top-margin) with the learned unmasking policy to reduce steps and improve accuracy for discrete LLMs.
    • Tool/Product/Workflow: A lightweight “OrderNet” module that plugs into existing MDM inference stacks (e.g., Hugging Face/Transformers-like decoders for diffusion LLMs), providing p_ψ-based position selection; VI training recipe with Rao-Blackwellized ELBO and RLOO gradient control variates.
    • Assumptions/Dependencies: Availability of an MDM denoiser; short fine-tuning stage to learn p_ψ; tasks that don’t require strict left-to-right generation; temperature tuning and batch-size considerations as per the paper.
  • Throughput and latency tuning for enterprise inference (finance, customer support, e-commerce)
    • Use case: Enforce decoding step budgets (e.g., T=5–10) for predictable latency while maintaining accuracy superior to common heuristics in parallel regimes.
    • Tool/Product/Workflow: Inference orchestrator that sets step budgets per request class and uses the learned order to adaptively unmask; latency/accuracy A/B dashboards; parallelization error monitors.
    • Assumptions/Dependencies: Stable denoiser calibration; domain-specific validation; monitoring for error/quality drift.
  • On-device math reasoning assistants (education, daily life)
    • Use case: Faster, more accurate step-limited solutions to math word problems (GSM8K-style) on edge devices with small diffusion LLMs (e.g., ~170M).
    • Tool/Product/Workflow: Mobile/embedded tutor apps using learned generation orders to keep average steps low (≈4–10), enabling offline reasoning; lightweight fine-tuning pipeline.
    • Assumptions/Dependencies: Task similarity to GSM8K; performance generalization from proof-of-concept; memory/compute constraints compatible with MDMs.
  • Structured text completion for forms and reports (healthcare, public sector, insurance)
    • Use case: Fill in masked fields of structured documents (clinical notes, claim forms, incident reports) with parallel decoding to reduce completion time.
    • Tool/Product/Workflow: Form-completion services that integrate MDM decoders with learned position selection; batch inference accelerators for partially masked inputs.
    • Assumptions/Dependencies: Domain adaptation and compliance reviews; guardrails for factuality; datasets with weak left-to-right dependence.
  • Faster code autocompletion and patch generation (software engineering)
    • Use case: Low-latency code completion and small patch generation using MDMs while controlling parallel unmasking to maintain quality.
    • Tool/Product/Workflow: IDE plugins that invoke learned order decoders; CI/CD automation for quick patch suggestions when masking edited regions.
    • Assumptions/Dependencies: Code-domain training; proper calibration of denoiser confidences; evaluation against strong autoregressive baselines.
  • Parallelization risk assessment and benchmarking (academia, industry R&D)
    • Use case: Quantify over-parallelization errors and mutual information issues across tasks; compare learned orders vs heuristic schedules.
    • Tool/Product/Workflow: “Parallelization Bench” extensions that log step budgets, token mutual information, and task metrics; reproducible VI training scripts.
    • Assumptions/Dependencies: Access to task suites (math, code, structured text); stable metrics and comparison protocols.
  • Energy and cost optimization for inference clusters (energy, cloud operations)
    • Use case: Reduce average decoding steps to cut compute time and energy per request; implement green SLAs for LLM services.
    • Tool/Product/Workflow: Cost/energy-aware decoders that apply learned orders; dashboards showing steps vs energy; automated “eco-mode” routing.
    • Assumptions/Dependencies: Realistic measurement pipelines; consistent quality outcomes at lower step counts; compatibility with KV caching and batching strategies.
  • Privacy-preserving on-prem deployments (policy/compliance)
    • Use case: On-prem or edge deployments for sensitive domains where parallel decoding reduces time-to-answer and mitigates throughput constraints.
    • Tool/Product/Workflow: Secure MDM inference stack with learned order selector; on-device training or federated learning for p_ψ.
    • Assumptions/Dependencies: Security certifications; domain fine-tuning; governance over data use; proper validation for safety and reliability.

Long-Term Applications

  • General-purpose Adaptive Parallel Decoder SDK across MDM ecosystems (software)
    • Use case: Unified library supporting multiple tasks (text, code, multimodal discrete tokens) with learned generation orders and flexible step budgets.
    • Tool/Product/Workflow: SDK that abstracts p_ψ and q_φ training, integrates with KV caching, speculative decoding, and position-aware calibration.
    • Assumptions/Dependencies: Broad adoption of diffusion LLMs; robust APIs across frameworks; evidence of gains at larger model scales.
  • Domain-specific clinical summarization and order set generation (healthcare)
    • Use case: High-throughput summarization and structured plan generation with reduced decoding steps while preserving clinical fidelity.
    • Tool/Product/Workflow: “Clinical OrderNet” trained on EHR-like datasets; auditing pipelines with human-in-the-loop review; task-specific calibration.
    • Assumptions/Dependencies: Regulatory compliance (HIPAA/GDPR), rigorous validation, domain-specialized pretraining, safeguards against hallucination.
  • Biological sequence design with order-aware masked diffusion (biotech)
    • Use case: Protein/DNA sequence generation where learned unmasking mitigates mutual-information errors from sampling multiple correlated positions.
    • Tool/Product/Workflow: Sequence-design workbench using VI-learned generation orders; batch sampling with controllable parallelism.
    • Assumptions/Dependencies: Reliable biological fitness proxies; domain evaluation; mapping task structure to discrete diffusion tokens.
  • Real-time planning from discrete action tokens (robotics)
    • Use case: Faster plan synthesis by learning which action tokens to unmask early, balancing parallelism with plan quality.
    • Tool/Product/Workflow: Planning modules that treat actions as discrete tokens and learn order selectors to meet control-loop deadlines.
    • Assumptions/Dependencies: Action-tokenization suitable for diffusion; safety-critical validation; integration with robot control stacks.
  • Hardware–software co-design for energy-aware diffusion LLMs (energy, semiconductor)
    • Use case: Pair learned order decoding with specialized caching/parallel hardware to minimize stalls and energy per token.
    • Tool/Product/Workflow: Co-optimized inference pipelines; schedulers binding step budgets to hardware pathways; profiling tools.
    • Assumptions/Dependencies: Hardware support for diffusion LLM primitives; standardized interfaces; consistent gains across workloads.
  • Standards for reporting parallelization metrics and compute footprints (policy, governance)
    • Use case: Require disclosure of average steps, energy per request, and parallelization error metrics in model cards and service SLAs.
    • Tool/Product/Workflow: Evaluation schemas and regulatory templates; audit tools verifying reported metrics.
    • Assumptions/Dependencies: Community consensus; alignment with emerging AI governance norms; repeatable measurement methodologies.
  • Cross-modal discrete diffusion with learned generation orders (multimodal AI)
    • Use case: Apply the VI framework to tokenized images/audio/video (discrete representations) to control parallel generation quality.
    • Tool/Product/Workflow: Multimodal decoders that learn token positions across modalities; hybrid pipelines combining AR and diffusion components.
    • Assumptions/Dependencies: Robust discrete tokenization for non-text modalities; denoiser architectures that benefit from bi-directional context.
  • RL-enhanced, prompt-adaptive unmasking policies (software, research)
    • Use case: Improve p_ψ by combining VI with reinforcement learning to adapt position selection on a per-prompt basis (hard prompts → less parallelism).
    • Tool/Product/Workflow: Policy training loops with reward shaping (accuracy/latency), curriculum strategies, and safety constraints.
    • Assumptions/Dependencies: Stable RL training in high-variance regimes; careful reward design; compute for exploration.
  • Hybrid speculative decoding for MDMs (software)
    • Use case: Couple learned orders with speculative or any-subset AR techniques to boost speed while controlling quality.
    • Tool/Product/Workflow: Speculative heads that propose unmask sets and are verified by the main denoiser; fallback heuristics.
    • Assumptions/Dependencies: Compatible verification procedures; robust error handling; task-specific tuning.
  • Position-aware calibration and distillation of unmasking policies (academia, industry)
    • Use case: Calibrate p_ψ to reduce decoding bias (e.g., via PC-Sampler-style calibration) and distill learned samplers into smaller models.
    • Tool/Product/Workflow: Calibrator modules that adjust position probabilities; sampler distillation pipelines; continual learning for domain drift.
    • Assumptions/Dependencies: Reliable calibration datasets; transferability across domains; monitoring for degradation over time.

Glossary

  • amortised posterior: A variational posterior parameterized by a shared network that maps inputs to posterior parameters, reused across data points. "using amortised posterior QϕQ_{\phi}"
  • ancestral sampling: A sampling procedure that draws each step from the model’s conditional distributions starting from an initial state. "via ancestral sampling"
  • approximate posterior: A tractable variational distribution used to approximate the intractable true posterior over latent variables. "The approximate posterior used for variational inference has a similar form"
  • Autoregressive Models (ARMs): Models that generate outputs one token at a time, conditioning on previously generated tokens. "their similarity with Autoregressive Models (ARMs)"
  • Bernoulli distribution: A binary probability distribution over {0,1} parameterized by a single success probability. "PΨ(rtnxt+1)=Bern(pψt,n(xt+1)1xt+1n=m).P_{\Psi}(r_t^{n} | \mathbf{x}_{t+1}) = \text{Bern}\left(p_{\psi}^{t, n}(\mathbf{x}_{t+1}) 1_{x_{t+1}^{n}=m}\right)."
  • bi-directional context: Using information from both left and right of a position when predicting tokens. "utilise bi-directional context during generation"
  • categorical distribution: A discrete distribution over a finite set of categories. "qts(xtxs)=Cat(xt;Qˉ(s,t)Txs)q_{t|s}(x_t|x_s) = \text{Cat}(x_t; \bar{Q}(s,t)^T x_s)"
  • conditional independence: An assumption that variables are independent given certain other variables, enabling factorized modeling. "we use the following conditional independence assumptions,"
  • control variates: Variance reduction techniques for Monte Carlo estimators that subtract a correlated baseline. "we use REINFORCE-Leave-One-Out (RLOO) control variates"
  • denoiser: The network component that predicts clean tokens (or x0x_0) from masked/noisy inputs. "weight the denoiser cross-entropy loss"
  • Discrete Diffusion Models (DDMs): Diffusion-based generative models defined over discrete state spaces. "Discrete Diffusion Models (DDMs)"
  • ELBO: Evidence Lower Bound; a variational objective that lower-bounds the log likelihood and is optimized instead of it. "We derive the associated ELBO objective"
  • forward Markov process: The noise-adding process in diffusion that evolves data toward a corrupted distribution over time. "we define a forward Markov process"
  • i.i.d. (independent and identically distributed): A collection of random variables that are mutually independent and share the same distribution. "explicitly include i.i.d binary token selection variables"
  • KL-divergence (Kullback–Leibler divergence): A non-symmetric measure of difference between two probability distributions. "The KL-divergence term encourages QQ to maintain an unmasking schedule"
  • latent variable model: A probabilistic model that includes unobserved variables inferred from observed data. "variational inference of a latent variable model through ELBO optimisation."
  • linear unmasking schedule: A decoding schedule that linearly determines how many tokens to unmask at each step. "we use a linear unmasking schedule to control the number of decoding steps"
  • Markov Decision Process (MDP): A framework with states, actions, transitions, and rewards used to model sequential decision-making. "formulating the generative model as a Markov Decision Process"
  • Masked Diffusion Models (MDMs): A class of discrete diffusion models that iteratively replace mask tokens with predicted tokens. "Masked Diffusion Models (MDMs) in particular have established themselves"
  • mutual information: A measure of dependence quantifying how much knowledge of one variable reduces uncertainty about another. "the mutual information between simultaneously generated token position distributions."
  • one hot encoding: A vector representation with a single 1 indicating the active category and 0s elsewhere. "the one hot encoding of xsx_s"
  • Rao-Blackwellisation: A variance reduction method that replaces a random variable by its conditional expectation given a sufficient statistic. "through Rao-Blackwellisation."
  • reparamaterised discrete diffusion: A reformulation introducing explicit random variables (e.g., selection variables) that preserves the original marginals. "this time discretised model can be reparamaterised"
  • REINFORCE: A score-function gradient estimator for stochastic nodes used in variational and policy-gradient methods. "we use REINFORCE to obtain an unbiased estimate of gradients"
  • REINFORCE-Leave-One-Out (RLOO): A REINFORCE variant using leave-one-out baselines to reduce gradient variance. "REINFORCE-Leave-One-Out (RLOO) control variates"
  • temperature scaling: A transformation dividing logits/scores by a temperature to control distribution sharpness and exploration. "the inclusion of the temperature scaling parameter to be beneficial"
  • time discretised model: A continuous-time process approximated by a sequence of discrete time steps. "this time discretised model"
  • Top Probability: A heuristic that unmasks the tokens with the highest predicted probabilities at each step. "Top Probability: At each decoding step,"
  • Top Probability Margin: A heuristic that unmasks tokens with the largest gap between the top-1 and top-2 predicted probabilities. "Top Probability Margin: Similarly to Top Probability, we unmask KK masked tokens"
  • variational inference: An optimization-based approach to approximate posterior inference using a parameterized family of distributions. "we propose a variational inference framework"

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 51 likes about this paper.