Papers
Topics
Authors
Recent
Search
2000 character limit reached

GreedTok: Greedy NLP Tokenization

Updated 6 January 2026
  • GreedTok is a family of greedy algorithms for tokenization and RL fine-tuning, using longest-prefix methods and token hidden rewards to boost model performance.
  • It employs stateless inference with O(n·D) complexity and a partition cover heuristic to achieve improved subword extraction and 3–5% better compression than BPE.
  • In reinforcement learning, GreedTok reweights token-level advantages via THR, yielding up to 4.1-point gains in pass@1 on math reasoning tasks.

GreedTok refers to a family of greedy algorithms for tokenization and token-level exploitation in NLP tasks. The term describes two distinct but related lines of research: (a) greedy inference strategies for word segmentation given a fixed subword vocabulary (notably, the "longest-prefix" greedy method known as GreedTok), and (b) a greedy token-level exploitation mechanism in reinforcement learning fine-tuning for LLMs based on the Token Hidden Reward (THR) metric. Both approaches are united by their use of practical, stateless, and efficient greedy schemes that either construct or utilize tokenizations to improve downstream model performance.

1. GreedTok in Tokenizer Inference

GreedTok was introduced as a greedy family of inference methods for segmenting text using a fixed vocabulary, specifically "longest-prefix," "longest-suffix," and "longest-token" strategies (Uzan et al., 2024). The most prevalent variant is the "longest-prefix" greedy method. Formally, given a vocabulary VV and a word ww (as a character string), the algorithm iteratively consumes the longest prefix of the remaining string that matches a token in VV, appends it to the output segmentation, and repeats until the full word is parsed. This process is stateless, requires no backtracking, and can be efficiently implemented using a prefix trie, resulting in O(nD)O(n \cdot D) complexity for word length nn and trie depth DD.

The core rationale is that by always selecting the maximal prefix, GreedTok maximizes the likelihood of recovering large, meaningful subword units, frequently corresponding to morphemes, and yielding a balanced distribution of token frequencies. This greedy approach stands in contrast to dynamic programming or sequence likelihood decoding used in some tokenization algorithms.

2. Optimization-Based Tokenization: Partition Cover and GreedTok

GreedTok also designates a greedy algorithm for solving an explicit combinatorial optimization formulation of tokenization—specifically, the partition cover problem (Lim et al., 8 Jan 2025). In this view, tokenization is cast as selecting a set SS of multi-symbol tokens (subject to a vocabulary budget Sk|S| \leq k) to minimize the expected number of tokens (pieces) needed to represent the corpus, or equivalently, to maximize the number of internal symbol boundaries that can be merged (covered) by tokens in SS.

The exact problem is NP-hard (shown via reduction from Vertex Cover). Nevertheless, GreedTok provides an efficient, practical heuristic as follows:

  • For each candidate token tTt \in T, precalculate all occurrences in the corpus.
  • Maintain a map M(W)M(W) for each word, recording which symbol boundaries are already covered.
  • Iteratively, for kk rounds, select the unused token tt with maximal marginal gain in coverage—i.e., the number of new boundaries it can cover without violating overlap/consistency constraints—and update M()M(\cdot) accordingly.

Tokenization of new words under the selected SS is based on matching and applying tokens in greedy-insertion priority. The selection of SS requires O(TkWW)O(|T| \cdot k \cdot \sum_{W} |W|) time, and tokenizing an individual word is O(W2logW)O(|W|^2 \log |W|).

The approach is closely related to the weighted maximum coverage (WMC) problem, for which the classic greedy algorithm achieves a (11/e)(1 - 1/e) approximation. Although the precise objective for tokenization is neither submodular nor supermodular, GreedTok empirically reaches within $0.9(1-1/e)$ of the WMC optimum as kk increases.

3. Evaluation Metrics and Empirical Results

GreedTok’s performance in greedy inference was benchmarked across morphological, cognitive, and information-theoretic criteria (Uzan et al., 2024):

  • Morphological Alignment (macro-F₁): GreedTok (longest-prefix) achieves significant gains over BPE default decoders (F₁ = 0.8584 vs 0.6309 for BPE; 0.9222 vs 0.9149 for UnigramLM).
  • Cognitive Plausibility (mean r|r|): Slightly lower but competitive scores relative to default algorithms.
  • Rényi Entropy Efficiency (α=2.5\alpha=2.5): GreedTok often matches or exceeds default methods, indicating a flatter, more uniform token distribution.
  • Tokens per Word: GreedTok provides slight gains in average tokens/word, with marginally fewer tokens per word for BPE and WordPiece.

For partition-cover tokenization (Lim et al., 8 Jan 2025), the following results were observed on large corpora (e.g., UN, arXiv abstracts, Wikipedia, PubMed):

Method Target Mean T/W Required kk Compression vs BPE
GreedTok 1.5 1,340 3–5% better
BPE 1.5 1,629
GreedWMC ~1.5 Slightly better

Thus, GreedTok requires on average 13% fewer tokens in the vocabulary to achieve a target mean tokens-per-word and yields 3–5% better compression vs BPE at fixed vocabulary size.

4. GreedTok in RL Fine-Tuning: Token Hidden Reward and Greedy Exploitation

An independent but related definition of GreedTok arises in LLM reinforcement learning fine-tuning, where it refers to a greedy, token-level exploitation dial based on the Token Hidden Reward (THR) metric (Deng et al., 4 Oct 2025). THR quantifies the causal impact of each token on the model's likelihood of correct completions under Group Relative Policy Optimization (GRPO).

Formally, for completions {yj}\{y_j\}, with rewards rjr_j, the THR for each token yj,ky_{j,k'} is computed by aggregating its influence on correct completions relative to the group's responses. Tokens with positive THR directly reinforce correct outputs (exploitation), while those with negative THR maintain probability mass for alternative (possibly correct) completions (exploration).

The GreedTok RL algorithm uses THR to reweight per-token GRPO advantages:

  • Mask dominant tokens by THR magnitude: mj,k=1{THRj,k>τ}m_{j,k} = \mathbf{1}\{|\mathrm{THR}_{j,k}| > \tau\}
  • Set GreedTok weights: wj,k(p)=1+sign(THRj,k)pw_{j,k}(p) = 1 + \mathrm{sign}(\mathrm{THR}_{j,k}) \cdot p
  • Compute reweighted advantage: A^j,kGreedTok=mj,kwj,k(p)A^j,k\hat{A}_{j,k}^{\rm GreedTok} = m_{j,k} \cdot w_{j,k}(p) \cdot \hat{A}_{j,k}
  • Insert into GRPO/GSPO loss for policy updates.

Increasing pp intensifies exploitation by upweighting tokens most predictive of correct completions. Empirical results show Pass@1 improvements of 1.5–4.1 points on math reasoning tasks (Qwen2.5‐Math and Llama3.2), with the method generalizing across architectures and objectives.

5. Practical Recommendations and Efficiency

GreedTok algorithms are computationally efficient and straightforward to implement:

  • Tokenizer inference (longest-prefix strategy): O(nD)O(n\cdot D) per word, highly parallelizable, ideal for production-scale systems as a drop-in replacement for BPE/WordPiece inference in libraries such as Huggingface or SentencePiece.
  • Partition-cover construction: Practical runtime for corpora of tens of millions of words, with polynomial complexity in vocabulary size and corpus length (Lim et al., 8 Jan 2025).
  • RL fine-tuning (THR-guided advantage weighting): THR computation adds <10%<10\% overhead to RL fine-tuning; scales with the number of token pairs.

Recommendations include using GreedTok as a replacement inference strategy to boost morphological alignment and compression without retraining, particularly for morphologically rich languages and linguistics tasks where subword boundary fidelity is paramount (Uzan et al., 2024). For RL-based LLM fine-tuning, GreedTok provides a principled and dynamic dial to steer the exploration–exploitation tradeoff at the token level (Deng et al., 4 Oct 2025).

6. Limitations and Theoretical Guarantees

While GreedTok approaches either match or surpass existing heuristics on pragmatic metrics, certain theoretical aspects remain unresolved:

  • For partition cover tokenization, though GreedTok closely matches the (11/e)(1-1/e) optimum for weighted maximum coverage, a formal approximation ratio has not been established due to additional overlap and consistency constraints present in the tokenization domain (Lim et al., 8 Jan 2025).
  • GreedTok inference is not guaranteed to optimize cognitive plausibility measures relevant to human segmentation preferences, although in most benchmarks it ties or only marginally trails state-of-the-art methods (Uzan et al., 2024).
  • Downstream LLM impacts (bit-per-byte measures, full-scale pretraining) for the partition-cover GreedTok construction remain under-evaluated; future work aims to systematically address these aspects.

7. Relationship to Other Methods and Future Directions

GreedTok is empirically surpassed only by context-informed or likelihood-driven inference (e.g., SaGe or UnigramLM with context) in specific settings. However, the simplicity, speed, and strong morphological alignment of GreedTok make it attractive for a broad array of applications. Innovative uses in RL-tuned LLMs suggest further investigation into token-level credit assignment metrics and greedy exploitation mechanisms.

A plausible implication is that continued research into hybrid greedy-dynamic schemes and token-level reward shaping will yield further advances in both unsupervised tokenization and RL-based fine-tuning for LLMs. There is ongoing interest in more formally characterizing the approximation properties of greedy tokenization with domain-specific constraints and in extending THR-guided fine-tuning to multiturn and dialog settings.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GreedTok.