Optimal form of the MDM-Prime subtokenizer
Determine the optimal functional form of the invertible subtokenizer f_ℓ: X → Y^ℓ in MDM-Prime that minimizes the variational bound on negative log-likelihood for a given token granularity ℓ and vocabulary size V, rather than relying on standard base-b encoding. Precisely characterize or construct an invertible mapping that achieves this optimum under the masked diffusion process used by MDM-Prime.
References
While Section~\ref{sec:methodology:likelihood} identifies the optimal value of $\ell$ for a given invertible subtokenizer $f_\ell$, the form of $f_\ell$ remains an open question.
— MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models
(2603.16077 - Chao et al., 17 Mar 2026) in Section 3.2, Increasing Sub-token Entropy via Index Shuffling