Optimal form of the MDM-Prime subtokenizer

Determine the optimal functional form of the invertible subtokenizer f_ℓ: X → Y^ℓ in MDM-Prime that minimizes the variational bound on negative log-likelihood for a given token granularity ℓ and vocabulary size V, rather than relying on standard base-b encoding. Precisely characterize or construct an invertible mapping that achieves this optimum under the masked diffusion process used by MDM-Prime.

Background

MDM-Prime extends masked diffusion LLMs by introducing a subtokenizer f_ℓ that maps each token index into a sequence of ℓ sub-tokens, enabling partial masking at the sub-token level. While the paper establishes that increasing ℓ tightens the variational bound and recommends binary encoding (ℓ = ⌈log2 V⌉), the specific functional form of f_ℓ critically affects likelihood estimation.

The authors analyze how f_ℓ influences the entropy of the latent variables and show that higher sub-token entropy tightens the variational bound, motivating an index shuffling heuristic composed with base-b encoding. However, they explicitly state that the precise form of f_ℓ that minimizes the variational bound remains open and provide index shuffling as an approximate, practical solution.

References

While Section~\ref{sec:methodology:likelihood} identifies the optimal value of $\ell$ for a given invertible subtokenizer $f_\ell$, the form of $f_\ell$ remains an open question.

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models  (2603.16077 - Chao et al., 17 Mar 2026) in Section 3.2, Increasing Sub-token Entropy via Index Shuffling