A partition cover approach to tokenization
Abstract: Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE and Unigram on compression and achieves a covering score comparable to GreedWMC. Finally, our extensive pre-training for two transformer-based LLMs with 1 billion parameters, comparing the choices of BPE and GreedTok as the tokenizer, shows that GreedTok achieves a lower bit per byte even when we control for either the total dataset proportion or total training tokens.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.