Entropy-Driven Pre-Tokenization for Byte-Pair Encoding (2506.15889v1)

Published 18 Jun 2025 in cs.CL

Abstract: Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern LLMs due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings.

Summary

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

The research presented in the paper titled "Entropy-Driven Pre-Tokenization for Byte-Pair Encoding" addresses the challenges of applying Byte-Pair Encoding (BPE) to languages that do not have explicit word boundaries, such as Chinese. BPE, a subword tokenization method, has been adopted widely due to its simplicity and effectiveness in representing subword units across various languages. However, the application of BPE to Chinese has posed significant difficulties due to its frequency-driven approach, which does not account for linguistic boundaries intrinsic to the language. This paper proposes two novel entropy-informed pre-tokenization strategies to improve BPE's alignment with linguistic structures in unsegmented languages.

Methodology

The authors introduce two independent methods that utilize entropy-based cues to guide tokenization boundaries:

Statistical Methods: This approach identifies potential segmentation boundaries using Pointwise Mutual Information (PMI) and left/right entropy, which capture local co-occurrence strength and contextual diversity. PMI helps gauge the association strength between adjacent characters, while entropy assesses contextual variability.
Auto-regressive LLM-based Methods: This method utilizes a pretrained GPT-2 model to derive predictive entropy, using the model's uncertainty to inform boundary placement. The conditional entropy of the next token prediction determines potential breakpoints in the text.

Both methods aim to derive a more granular token structure that aligns better with linguistic units, thus enhancing the performance of subsequent LLMing tasks. The core hypothesis is that entropy, as an unsupervised information-theoretic metric, can effectively indicate linguistically plausible boundaries.

Experimental Evaluation

The research employs a subset of the PKU dataset, traditionally utilized as a benchmark for Chinese word segmentation, to test the proposed methods. By comparing the performance of the entropy-driven pre-tokenization against traditional BPE, the paper assesses tokenization precision, recall, and F1 scores. This evaluation demonstrates substantial improvements when using entropy-based strategies. Specifically, the statistical method with a careful balance of PMI and entropy yielded the highest F1 score of 58.73, significantly outperforming standard BPE by over 9 percentage points. The GPT-2 model-based entropy method also showed competitive results, highlighting its potential in capturing meaningful boundaries.

Implications and Future Directions

The implications of this research are manifold. Practically, these methods offer an enhanced approach to subword tokenization for languages without explicit segment delimiters, promising improvements in applications like machine translation and named entity recognition. Theoretically, this work opens avenues for integrating unsupervised entropy signals into other aspects of LLM training and tokenization. It advocates for a shift from purely frequency-based tokenization schemes to those informed by statistical and learned uncertainties.

Future research could explore the integration of these entropy-driven methods in larger LLMs and examine their impact on downstream tasks across diverse unsegmented languages. Further, the adaptability of these methods to byte-level tokenization presents an opportunity to unify tokenization strategies across multilingual datasets, making LLMs more robust and contextually aware in processing diverse linguistic structures. Overall, the paper provides a compelling case for leveraging entropy-based information to improve the granularity and linguistic alignment of tokenized sequences in computational linguistics.

Related Papers

Tweets

https://twitter.com/KempnerInst/status/1940833254653022609