Tokenisation is NP-Complete (2412.15210v1)
Abstract: In this work, we prove the NP-completeness of two variants of tokenisation, defined as the problem of compressing a dataset to at most $\delta$ symbols by either finding a vocabulary directly (direct tokenisation), or selecting a sequence of merge operations (bottom-up tokenisation).
Summary
- The paper proves NP-completeness by reducing well-known max-2-SAT problems to both direct and bottom-up tokenisation formulations.
- It formalizes optimal subword vocabulary compression and merge operations, highlighting computational challenges in NLP tokeniser design.
- The findings emphasize the need for heuristic approaches since efficient, optimal tokenisation is computationally infeasible.
Overview of "Tokenisation is NP-Complete"
The paper, "Tokenisation is NP-Complete," authored by Philip Whittington, Gregor Bachmann, and Tiago Pimentel, addresses the computational complexity of the tokenisation problem in NLP. Focusing on two specific tokenisation forms—direct tokenisation and bottom-up tokenisation—the authors demonstrate the NP-completeness of these variants, thus highlighting the inherent computational challenges in optimizing tokeniser design for LLMs (LMs).
Main Contributions
- Theoretical Framework: The paper formalizes two variants of the tokenisation problem. In direct tokenisation, the objective is to find an optimal vocabulary that maximizes the compression of a dataset of character strings into subwords, considering a predefined maximum number of symbols, $\maxsymbols$. Bottom-up tokenisation, often exemplified through Byte Pair Encoding (BPE), involves finding an optimal sequence of merge operations to achieve the same goal.
- NP-Completeness Proof: The authors prove that both direct and bottom-up tokenisation are NP-complete problems. They provide a detailed reduction from the maximum 2-satisfiability (max-2-SAT) problem, a known NP-hard problem, to each tokenisation variant. By demonstrating this reduction, the paper establishes the computational equivalence of finding optimal tokenisers and solving NP-hard problems.
- Example Applications: Through methodical proofs and illustrative constructions, the paper exhibits the impact of optimal tokenisation choices on downstream LLM applications. Inefficient tokenisers can detrimentally affect performance in tasks such as arithmetic operations and lexical counting, which are crucial for accurate LLM outputs.
- Alternative Tokenisation Strategies: The paper discusses other potential approaches to tokenisation beyond the direct and bottom-up methods. These include deduplicated datasets, single long strings, and hybrid approaches combining elements from both studied tokenisation variants. Each approach carries implications for computational efficiency and LLM performance.
Implications and Future Directions
The findings underscore that discovering efficient algorithms for optimal tokenisers is unlikely due to the NP-completeness of the problem. Consequently, this necessitates reliance on heuristic and approximate methods, such as BPE and UnigramLM, which are widely used yet lack optimality guarantees.
- Practical Implications:
Theoretical insights from this paper inform the design of tokenisation algorithms that trade off between computational feasibility and approximation quality. Practitioners in NLP are encouraged to develop newer tokenisation strategies that provide favorable approximations without incurring prohibitive computational costs.
- Theoretical Implications:
The results open avenues for further research on other variations of the tokenisation problem, perhaps those defined by different objective functions or constraints. Understanding these subtleties could significantly enhance the theoretical understanding of LLM preprocessing stages.
- Exploration of Compression Techniques:
The relationship between tokenisation and dictionary compression is ripe for further exploration. Improvements in compression methods could potentially translate to more effective tokenisation techniques by leveraging advancements in compressive data representations.
Conclusion
Overall, this paper makes substantial contributions to the field of computational linguistics, providing a rigorous examination of tokenisation from a computational complexity perspective. By establishing tokenisation as an NP-complete problem, the authors shift the focus towards approximate and heuristic methods, while also inviting future research into alternate approaches and objective functions for tokenisation. The interplay between compression, computational complexity, and tokenisation elucidated in this work offers a foundational understanding for both theoretical advancement and practical application in NLP systems.
Related Papers
- Improving Tokenisation by Alternative Treatment of Spaces (2022)
- Tokenization Is More Than Compression (2024)
- Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance (2024)
- Infusing clinical knowledge into tokenisers for language models (2024)
- Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles (2024)
Tweets
YouTube
HackerNews
- Tokenisation Is NP-Complete (117 points, 24 comments)