Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Foundations of Tokenization: Statistical and Computational Concerns (2407.11606v3)

Published 16 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model performance but is also the source of many undesirable behaviors, such as spurious ambiguity or inconsistency. Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood. In particular, the impact of tokenization on statistical estimation has been investigated mostly through empirical means. The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models. Based on the category of stochastic maps, this framework enables us to establish general conditions for a principled use of tokenizers, and most importantly, the necessary and sufficient conditions for a tokenizer model to preserve the consistency of statistical estimators. Additionally, we discuss statistical and computational concerns crucial for designing and implementing tokenizer models, such as inconsistency, ambiguity, tractability, and boundedness. The framework and results advanced in this paper contribute to building robust theoretical foundations for representations in neural LLMing that can inform future empirical research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Juan Luis Gastaldi (6 papers)
  2. John Terilla (12 papers)
  3. Luca Malagutti (5 papers)
  4. Brian DuSell (14 papers)
  5. Tim Vieira (29 papers)
  6. Ryan Cotterell (226 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.