Language Models over Canonical Byte-Pair Encodings

Published 9 Jun 2025 in cs.CL, cs.FL, and cs.LG | (2506.07956v1)

Abstract: Modern LLMs represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up LLMs to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of $\it{noncanonical}$ token encodings of each character string -- these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level LLMs, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

Abstract PDF Upgrade to Chat

Summary

The paper shows that enforcing canonicality reduces KL divergence, leading to an accurate fit between estimated and true distributions.
It introduces two methods: conditioning at test time and integrating canonical constraints into the model architecture.
Empirical evaluations demonstrate reduced log-loss and enhanced efficiency on datasets like Penn Treebank and WikiText.

LLMs over Canonical Byte-Pair Encodings

The paper "LLMs over Canonical Byte-Pair Encodings" presents a comprehensive study on enforcing canonicality in token-level LLMs and the implications of such enforcement in the probability distribution over character strings. The study focuses on addressing the issues of noncanonical token encodings, which modern LLMs, such as those employing byte-pair encoding (BPE), erroneously and wastefully assign positive probability mass to—diverting resources away from plausible outputs.

Overview

LLMs typically utilize tokenization methods to represent texts as sequences of tokens that are shorter than the original character strings. Byte-pair encoding (BPE) is a prevalent technique that facilitates large-scale language modeling by compressing strings into sets of tokens based on a pre-defined list of merging rules. This deterministic tokenization approach significantly aids scaling but introduces a misleading allocation problem where noncanonical strings, which are not representative of the training corpus, end up with nonzero probability mass.

The paper details two approaches to mitigate this misallocation:

Canonicality by Conditioning: This approach modifies test-time inference strategies without requiring an alteration to the training process. It relies on conditioning the model's outputs to only include canonical strings during sampling and estimation procedures.
Canonicality by Construction: This necessitates training adjustments, where model architectures include the canonicality constraint in their parameterization—which guarantees the generation of canonical token strings.

Key Contributions

The authors contribute several theoretical and empirical insights:

Proofs and Guarantees: They demonstrate formally that fixing canonicality reduces KL divergence between the estimated and true distribution, implying a closer fit to the desired distribution.
Empirical Evaluations: Through various model sizes and corpora, they establish improvements in data likelihood when employing canonicality-enforced models.
Efficient Testing: The paper introduces an efficient incremental canonicality test for BPE, argues for its simplification over previous work, and proposes that it avoids complex automata theory.

Numerical Results

The study empirically validates the methods with strong numerical results, showing reductions in log-loss for multiple datasets and models. Significant improvements are observed in log-loss when applying globally canonicalized models on datasets such as Penn Treebank and WikiText. The evaluation also highlights how local approximations can closely follow global methods when the misallocated probability mass is typically small—demonstrating practical viability with reduced computational demands.

Implications and Future Directions

The authors suggest that the study's findings pave the path for enhanced reliability and robustness in LLMs by ensuring alignment with the canonical distributions derived from the original training data. Future developments could explore the integration of canonical constraints in other tokenization schemes and the implications on varied downstream tasks. Moreover, the exploration of automatized methods for constraining noncanonical outputs within new architectures or Ai-based systems offers a promising direction for continued research.

Conclusion

The paper effectively advocates for a stricter adherence to canonical outputs in token-level LLMs, arguing that doing so streamlines model efficiency and accuracy. It provides profound theoretical and empirical backing for its methods, reflecting on both the practical and broader implications for AI systems utilizing tokenization. The research thus invites further conversation and exploration into its methodologies, potentially extending beyond BPE and into other realms of computational linguistics and machine learning.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Language Models over Canonical Byte-Pair Encodings

Summary

LLMs over Canonical Byte-Pair Encodings

Overview

Key Contributions

Numerical Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (10)

Collections

Tweets

Language Models over Canonical Byte-Pair Encodings

Summary

LLMs over Canonical Byte-Pair Encodings

Overview

Key Contributions

Numerical Results

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

Tweets