Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Tokenization as Finite-State Transduction (2410.15696v1)

Published 21 Oct 2024 in cs.CL and cs.FL

Abstract: Tokenization is the first step in modern neural LLM pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, fit within this framework. For BPE, this is particularly surprising given its resemblance to context-free grammar and the fact that it does not tokenize strings from left to right. An application of this is to guided generation, where the outputs of a LLM are constrained to match some pattern. Here, patterns are encoded at the character level, which creates a mismatch between the constraints and the model's subword vocabulary. While past work has focused only on constraining outputs without regard to the underlying tokenization algorithm, our framework allows for simultaneously constraining the model outputs to match a specified pattern while also adhering to the underlying tokenizer's canonical tokenization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Alfred V. Aho and Margaret J. Corasick. 1975. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340.
  2. Cyril Allauzen and Mehryar Mohri. 2009. N-way composition of weighted finite-state transducers. Int. J. Found. Comput. Sci., 20(4):613–627.
  3. Cyril Allauzen and Michael D. Riley. 2018. Algorithms for weighted finite automata with failure transitions. In Implementation and Application of Automata - 23rd International Conference, CIAA 2018, Charlottetown, PE, Canada, July 30 - August 2, 2018, Proceedings, volume 10977 of Lecture Notes in Computer Science, pages 46–58. Springer.
  4. On the compression of lexicon transducers. In Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing, FSMNLP 2019, Dresden, Germany, September 23-25, 2019, pages 18–26. Association for Computational Linguistics.
  5. Philip Gage. 1994. A new algorithm for data compression. The C Users Journal archive, 12:23–38.
  6. Automata-based constraints for language model decoding. In First Conference on Language Modeling.
  7. Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012, pages 5149–5152. IEEE.
  8. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
  9. Michael Sipser. 1997. Introduction to the theory of computation. PWS Publishing Company.
  10. Fast WordPiece tokenization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2089–2103, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. Brandon T. Willard and Rémi Louf. 2023. Efficient guided generation for large language models.
  12. A formal perspective on byte-pair encoding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 598–614, Toronto, Canada. Association for Computational Linguistics.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.