Performance of non-causal patch boundaries when training from scratch

Determine how employing non-causal patch boundary prediction in Latent Tokenizer Language Models (such as the Bolmo architecture) performs when models are trained from scratch, and establish whether the increased expressivity of the boundary predictor is generally beneficial compared to causal boundary prediction across standard language modeling and downstream tasks.

Background

Bolmo introduces a non-causal boundary predictor that uses one byte of future context, which is central to its byteification strategy and helps match the expressivity of subword tokenizers during prefill. The paper focuses exclusively on byteifying existing subword-level LMs rather than training byte-level models from scratch.

The authors note they have not assessed how non-causal boundaries behave under training-from-scratch regimes and explicitly state uncertainty about whether the increased expressivity will translate into general benefits outside the byteification setting.

References

For example, we have not assessed how non-causal patch boundaries perform when training from scratch. We expect that the increased expressivity of the boundary predictor might be generally useful, but we do not yet know.

— Bolmo: Byteifying the Next Generation of Language Models (2512.15586 - Minixhofer et al., 17 Dec 2025) in Section 7 (Future Directions), Bit 0

Performance of non-causal patch boundaries when training from scratch

Sponsor

Background

References

Related Problems