SCUD: Schedule-Conditioned Discrete Diffusion

Updated 30 June 2025

SCUD is a discrete diffusion framework that conditions the generative process on explicit noise event schedules in Markov processes.
It integrates structured, domain-specific noising processes into the training objective to improve sample quality and likelihood.
Empirical results show SCUD outperforms masking and classical diffusion methods in image, language, and protein modeling tasks.

Schedule-Conditioned Discrete Diffusion (SCUD) denotes a class of discrete diffusion models in which the generative process is explicitly conditioned on the schedule—the sequence and timing of noise (corruption) events in the underlying Markov process. SCUD generalizes prior approaches like masking diffusion by incorporating the analytical distribution of jump times (i.e., the event schedule) of discrete Markov processes directly into both the model’s architecture and its training objective. This framework enables the integration of structured, domain-informed noising processes with efficient, analytically grounded generative modeling for categorical data, including images, language, and protein sequences.

1. Mathematical Foundations: Schedules in Discrete Markov Processes

Discrete diffusion models operate by reversing a noising process described by a Markov chain on a categorical state space, with transitions governed by a generator matrix $\mathcal{L}$ . Unlike their continuous counterparts, discrete Markov processes evolve by a sequence of abrupt state changes (“jumps”), where waiting times between transitions are exponentially distributed: $\mathbb{P}(\text{jump after }\Delta t \,|\, \text{in state }b) = 1 - e^{-\mathcal{L}_{b,b}\Delta t}.$ The distribution of jump times $S = \{t_1, t_2, ..., t_M\}$ for a trajectory is analytically tractable under the forward process $p(S)$ .

SCUD leverages this property by decomposing the generative objective with respect to these transition schedules. The standard evidence lower bound (ELBO) for diffusion models is refined to: $\mathbb{E}_{p((x_t)_t)} \log \frac{q_\theta\big((x_t)_t\,|\,x_1, S\big)}{p\big((x_t)_t\,|\,x_0, x_1, S\big)} - \mathrm{KL}(p(S)\|q_\theta(S)) - \mathbb{E}_{p(S, x_0)}\mathrm{KL}\big(p(x_1|S, x_0)\|q_\theta(x_1|S)\big) + C,$ which separates learning the “where” of transitions (the state changes given a schedule) from the “when” (the timing, encoded in $S$ ).

2. Mechanism of Schedule Conditioning in SCUD

In SCUD, the backward model is conditioned on the schedule $S$ , which is either fixed (matched to the analytic distribution of the forward process) or incorporated as a latent variable. With $q_\theta(S) = p(S)$ , the model does not need to learn the time of jumps, only the transitions:

At every event $t_m \in S$ , the model predicts $q_\theta(x_{t_{m-1}}|x_{t_m}, s_{t_m})$ , where $s_{t_m}$ is the number of events up to $t_m$ .
The loss at each event is

$\mathrm{KL}\left(p(x_{t_{m-1}}|x_{t_m}, x_0, s_{t_m})\,\|\, q_\theta(x_{t_{m-1}}|x_{t_m}, s_{t_m})\right)$

and the full loss sums this over all events, plus regularization ensuring convergence to the stationary distribution.

Efficient implementations sample a time $t$ uniformly and weight the KL by the instantaneous event rate and the cumulative number of events, enabling large-scale training even with structured noising processes in high-dimensional settings.

3. Comparative Performance: SCUD, Masking, and Structured Diffusions

On typical image, language, and protein modeling tasks, masking diffusion—where the model conditions on the schedule from a simple, uniform noising process—historically exhibited superior empirical performance compared to classical structured discrete diffusion models. The analysis in SCUD reveals why:

Masking diffusion “bakes in” the jump schedule, reducing the generative learning problem to predicting where transitions occur, not when.
Prior structured forward processes (e.g., Gaussian intensity transitions for images, BLOSUM matrices for amino acids) performed worse unless combined with schedule conditioning.

Empirically, SCUD outperforms both masking and classical models:

On CIFAR-10, models with schedule conditioning and structured forward processes achieve lower negative log-likelihood (bits per dimension) and superior sample quality compared to masking and classical unstructured baselines.
On protein and LLMing (UniRef50, LM1B), SCUD with domain-informed noising substantially improves perplexity over previous methods, with masking as a special (uniform) case.

A continuum is observed: increasing the amount of schedule information conditioned in the model ( $\gamma$ parameterization) smoothly interpolates between classical diffusion ( $\gamma \to 0$ ) and masking ( $\gamma \to 1$ ), with empirical results consistently favoring more schedule information.

4. Incorporation of Inductive Biases and Structured Forward Processes

SCUD enables the use of domain-specific inductive biases in the noising process without sacrificing generative ease or sample quality:

Images: A forward process based on a Gaussian kernel over intensity values, with higher probability transitions to similar pixel values. SCUD exploits the analytical schedule to yield better likelihood and more realistic image sampling than masking or classical PixelCNN-type models.
Proteins: Use of empirically derived BLOSUM substitution matrices enables biologically plausible forward processes on discrete sequence data. With SCUD, these matrices yield lower perplexity and better modeling of protein language than uniform or masking approaches.
Language: Construction of structured nearest-neighbor graphs for frequent vocabulary tokens, with SCUD allowing tractable computation and improved perplexity, even with large vocabularies.

Prior models could not efficiently exponentiate large structured generators to compute the forward process for complex alphabets. SCUD’s analytic schedule conditioning, by focusing learning on state transitions for a given event pattern, enables efficient implementation for sparse, structured $\mathcal{L}$ , making such processes practical at scale.

5. Theoretical Clarification: SCUD as a Generalization

Schedule-conditioned discrete diffusion generalizes and subsumes both classical discrete diffusion and masking:

Classical discrete diffusion: model must learn both when and where to jump; learning the schedule is hard and suboptimal.
Masking diffusion: model is told exactly when to jump (schedule is fully known and uniform); only required to learn the destination.
SCUD: allows arbitrary schedule conditioning, enabling models to leverage any forward process (with or without inductive bias), simply by matching the forward and backward event distributions.

Over-conditioning—for example, on all fine-grained details of mutation events—can complicate the denoising task or hinder convergence to the stationary distribution, a potential direction for further investigation.

6. Practical Implementation and Extensions

Key practical features demonstrated in the paper include:

Efficient training is possible by leveraging the factored structure of schedules in high-dimensional data (e.g., each pixel or sequence position has an independent schedule, permitting parallel learning and sampling).
The analytic form of the jump schedule allows complex, sparse, or biologically-inspired forward processes to be used even in large state spaces.
SCUD is compatible with improved sampling algorithms and flow-matching approaches (see Appendix in the paper), suggesting further potential for quality and efficiency improvements.

Sample quality improvements under SCUD are observed not only in likelihoods but also in subjective and visual sample assessments.

Summary Table: SCUD vs. Prior Approaches

Aspect	Classical Discrete Diffusion	Masking Diffusion	SCUD (Schedule-Conditioned)
Schedule information in model	None (“when” not known)	Full (simple, uniform)	General, analytic (arbitrary/noising)
Suitability for structured processes	Difficult, often suboptimal	Generally not possible	Efficient and empirically superior
Learning task	“When” and “where” to jump	Only “where”	Only “where” (with known schedule)
Empirical sample quality	Inferior	High	Highest (when inductive bias present)
Theoretical status	Subsumed	Special case of SCUD	Most general case

7. Future Directions and Open Questions

Potential research avenues include:

Optimizing the amount and type of schedule information conditioned on in the backward model to balance inductive bias and sample quality.
Integrating SCUD with advanced sampling strategies and flow-matching for further performance gains.
Extending SCUD to schedules learned or adapted dynamically in response to external signals or data complexity.

In summary, the SCUD framework provides a theoretically principled and practically efficient method for discrete diffusion modeling, unlocking optimal generative performance, particularly when structured noising processes capture domain-specific inductive biases. By focusing the learning problem on transition “destinations” and incorporating schedule information analytically, SCUD generalizes, explains, and surpasses previous discrete diffusion approaches, including masking diffusion.

PDF Markdown Chat (Upgrade)