Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
22 tokens/sec
GPT-5 High Premium
29 tokens/sec
GPT-4o
100 tokens/sec
DeepSeek R1 via Azure Premium
82 tokens/sec
GPT OSS 120B via Groq Premium
469 tokens/sec
Kimi K2 via Groq Premium
210 tokens/sec
2000 character limit reached

The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models (2508.04884v1)

Published 6 Aug 2025 in stat.ML and cs.LG

Abstract: In this work, we study the problem of choosing the discretisation schedule for sampling from masked discrete diffusion models in terms of the information geometry of the induced probability path. Specifically, we show that the optimal schedule under the Fisher-Rao geometry recovers the popularly-used cosine schedule.

Summary

  • The paper demonstrates that the cosine schedule is Fisher-Rao-optimal by deriving it as the geodesic of the Fisher-Rao metric for masked discrete diffusion.
  • It leverages a closed-form computation of the Fisher-Rao metric to show that uniform discretization under this metric leads to an optimal cosine schedule.
  • The findings provide a theoretical justification for the empirical success of the cosine schedule while suggesting improved design for discrete diffusion model samplers.

Fisher-Rao Optimality of the Cosine Schedule in Masked Discrete Diffusion Models

Introduction

This paper addresses the problem of discretization schedule selection in masked discrete diffusion models, focusing on the information geometry of the induced probability path. The central result is that the widely used cosine schedule is not merely a heuristic but is in fact Fisher-Rao-optimal for masked discrete diffusion. The analysis leverages the closed-form computation of the Fisher-Rao metric for the probability path induced by the forward noising process, and derives the optimal schedule as the geodesic under this metric. This provides a theoretical justification for the empirical success of the cosine schedule in discrete diffusion models.

Background and Theoretical Framework

Masked Discrete Diffusion

The masked discrete diffusion process is defined over sequences of discrete tokens, where each token can be replaced by a special "masked" state. The forward process is a continuous-time Markov chain (CTMC) parameterized by a time-dependent masking rate β(t)\beta(t), leading to a marginal distribution at time tt given by

q(xtx0)=n=1Nq(xt(n)x0(n)),q(xt(n)x0(n))=Cat(xt(n);Qˉ(t)x0(n)),q(x_t|x_0) = \prod_{n=1}^N q(x_t^{(n)}|x_0^{(n)}), \quad q(x_t^{(n)}|x_0^{(n)}) = \text{Cat}(x_t^{(n)}; \bar{Q}(t)^\top x_0^{(n)}),

where Qˉ(t)=αtI+(1αt)em\bar{Q}(t) = \alpha_t I + (1-\alpha_t) e_m^\top and αt=exp(0tβ(s)ds)\alpha_t = \exp\left(-\int_0^t \beta(s) ds\right). The process is factorized across tokens, and the probability of a token being masked at time tt is 1αt1-\alpha_t.

Information Geometry and the Fisher-Rao Metric

The family of marginal distributions {qt}\{q_t\} forms a 1D manifold, and the Fisher-Rao metric provides a natural Riemannian structure. For a path qtq_t, the Fisher-Rao metric at time tt is

I(t)=Varxtqt(tlogqt(xt)).I(t) = \text{Var}_{x_t \sim q_t} \left( \partial_t \log q_t(x_t) \right).

The length of a path under a metric δ(t)\delta(t) is

Λ=01δ(φ(t))φ˙(t)dt,\Lambda = \int_0^1 \sqrt{\delta(\varphi(t))} \dot{\varphi}(t) dt,

where φ\varphi is a reparameterization of time. The optimal schedule is the geodesic under this metric, traversing the path at constant speed.

Optimal Discretization Schedules

The optimal schedule is derived by minimizing the energy functional associated with the Fisher-Rao metric. The geodesic φ\varphi^* satisfies

φ(t)=Λ1(Λt),Λ(s)=0sδ(r)dr.\varphi^*(t) = \Lambda^{-1}(\Lambda t), \quad \Lambda(s) = \int_0^s \sqrt{\delta(r)} dr.

This ensures that the "distance" (as measured by the Fisher-Rao metric) between consecutive discretization points is uniform, optimizing the allocation of computational resources.

Main Result: Cosine Schedule as Fisher-Rao Geodesic

The main theorem establishes that, for masked discrete diffusion, the Fisher-Rao metric can be computed in closed form:

I(t)=Nα˙t2αt(1αt).I(t) = \frac{N \dot{\alpha}_t^2}{\alpha_t (1-\alpha_t)}.

Integrating this metric yields

Λ(t)=2N(π2arcsinαt).\Lambda(t) = 2\sqrt{N} \left( \frac{\pi}{2} - \arcsin \sqrt{\alpha_t} \right).

Solving for the geodesic, the optimal schedule is given by

αti=cos2(iN(π2arcsinα1)),\alpha_{t_i^*} = \cos^2 \left( \frac{i}{N} \left( \frac{\pi}{2} - \arcsin \sqrt{\alpha_1} \right) \right),

and, for α1=0\alpha_1 = 0 (full masking at t=1t=1), this reduces to the standard cosine schedule:

αti=cos2(iTπ2).\alpha_{t_i^*} = \cos^2 \left( \frac{i}{T} \frac{\pi}{2} \right).

This result rigorously justifies the use of the cosine schedule, previously introduced heuristically, as Fisher-Rao-optimal for masked discrete diffusion.

Implications and Limitations

The theoretical result provides a principled foundation for the cosine schedule in discrete diffusion models, aligning the discretization steps with the information geometry of the probability path. This has practical implications for the design of discrete diffusion samplers, suggesting that the cosine schedule is not only empirically effective but also optimal in a precise geometric sense.

However, the analysis is limited to the true probability path qtq_t and does not account for errors introduced by model approximation or the specifics of the sampling scheme. The optimality is established under the Fisher-Rao metric; alternative metrics may yield different schedules. Empirical validation of the theoretical optimality, as well as extensions to other forms of discrete corruption processes or alternative metrics, remain open directions.

Future Directions

Potential avenues for further research include:

  • Empirical evaluation of the Fisher-Rao-optimal schedule in practical masked discrete diffusion models, particularly in the presence of model and sampling errors.
  • Extension of the information-geometric analysis to other discrete corruption processes, such as uniform discrete diffusion.
  • Investigation of optimal schedules under alternative metrics, such as Wasserstein or pathwise KL.
  • Analysis of the impact of schedule choice on convergence rates and sample quality in high-dimensional discrete generative modeling.

Conclusion

This work establishes that the cosine schedule is Fisher-Rao-optimal for masked discrete diffusion models, providing a theoretical justification for its widespread use. The result bridges the gap between heuristic practice and information geometry, and motivates further exploration of geometric principles in the design and analysis of discrete generative models.

Authors (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube