Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 169 tok/s Pro
GPT OSS 120B 469 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Formalizing BPE Tokenization (2309.08715v1)

Published 15 Sep 2023 in cs.FL

Abstract: In this paper, we formalize practical byte pair encoding tokenization as it is used in LLMs and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace tokenizers, in particular how they relate to each other, depending on how the tokenization rules are constructed. Beyond this we consider how tokenization can be performed in an incremental fashion, as well as doing it left-to-right using an amount of memory constant in the length of the string, enabling e.g. using a finite state string-to-string transducer.

Citations (5)

Summary

We haven't generated a summary for this paper yet.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.