Leaf-Based Vocabulary Pruning

Updated 10 December 2025

The paper introduces a structural, frequency-guided method that safely removes leaf tokens to preserve the vocabulary's combinatorial integrity.
It employs heap-based and breadth-first algorithms for BPE tokenizers and diffusion classifiers to prune without sacrificing tokenization determinism or classification accuracy.
Empirical results demonstrate up to 62.5% vocabulary reduction and nearly 60% speedup in inference, confirming the method’s balance between efficiency and performance.

Leaf-based vocabulary pruning is a structural, frequency-guided approach for compacting and optimizing the vocabularies used in machine learning models, specifically tokenizers for LLMs and label sets in diffusion-based classifiers. The central principle is the safe removal of "leaf" nodes—elements that terminate derivation chains or merge graphs—such that the critical combinatorial structure and essential coverage of the vocabulary are preserved. This method has been independently proposed for Byte-Pair-Encoding (BPE) tokenizer adaptation and for reducing class evaluation overhead in diffusion models, yielding empirical gains in inference efficiency without measurable loss in model quality or task performance (Purason et al., 3 Dec 2025, Shanbhag et al., 18 Nov 2024).

1. Formal Definition of Leaves in Structured Vocabularies

In BPE-based tokenizers, the vocabulary and its evolution through merges can be formalized as a directed acyclic graph (DAG) induced by the ordered list of merge rules $M$ . Each node in the graph is a token, with edges defined as follows: when two tokens $(a, b)$ are merged to form a new token $c$ , $c$ is downstream of $a$ and $b$ . In this context, a token $t$ is a "leaf" if it never serves as a parent in any further merge, i.e., it has zero out-edges in the DAG and does not appear as the left-hand side of any subsequent merge rule in $M$ (Purason et al., 3 Dec 2025).

Analogously, in hierarchical classification (as with diffusion classifiers), leaf nodes are class labels with no children in the taxonomic tree. In practice, these are terminal categories, such as image classes in a pruned WordNet hierarchy used for ImageNet-1K, where higher-level synsets are internal nodes and the most specific categories are leaves (Shanbhag et al., 18 Nov 2024).

2. Pruning Criteria and Objectives

Leaf-based vocabulary pruning introduces a frequency-driven, structurally-constrained removal criterion:

Frequency pruning for tokenizers: Each leaf token $t$ is assigned a score $s(t)=\operatorname{freq}(t)$ , its empirical frequency in a representative corpus. The goal is to prune the set of leaf tokens with minimal aggregate frequency, ensuring that only leaves are eligible for removal at each stage. The process iteratively removes the lowest-frequency leaves until a target vocabulary size is achieved, optimizing for minimal loss of representational coverage on in-domain texts (Purason et al., 3 Dec 2025).
Hierarchical class pruning in diffusion models: At each internal node in the label tree, candidate children are scored based on their diffusion-model denoising error (Empirical mean squared error over small Monte-Carlo samples). Only a fixed top fraction $K_d$ or those within a dynamic threshold of the best error are retained, while the rest are pruned. Final classification proceeds only among surviving leaves (Shanbhag et al., 18 Nov 2024).

This selective, order-preserving strategy stands in contrast to naïve frequency-based or arbitrary last- $N$ pruning, which may compromise reachability or model determinism.

3. Algorithms and Complexity

3.1. Leaf-Based Pruning for BPE Tokenizers

The procedure employs a heap-based priority queue over current leaves, using their corpus frequency as the ordering key. At each iteration, the leaf with the lowest frequency is pruned, and its immediate sub-tokens are examined to determine if they have become leaves (i.e., no longer serve as components in any other merge). This process continues until the desired vocabulary size is reached. Atomic tokens (those never produced by a merge) are handled to preserve base coverage (Purason et al., 3 Dec 2025).

Pseudocode summary:

Construct a min-heap of leaves by frequency.
Repeatedly remove the minimum, updating parenthood counters.
Newly exposed leaves are added to the heap.
Each token is pushed and popped at most once, for a total runtime $O(|V| \log |V| + |M|)$ where $|V|$ is the vocabulary size and $|M|$ the number of merges.

3.2. Hierarchical Diffusion Class Pruning

The pruning stage is implemented as a breadth-first traversal down the label hierarchy.

Pseudocode summary:

For each test input, begin with the root’s immediate children.
At each hierarchy level $d$ : for each currently selected parent, compute the mean diffusion error for each child over $M_{\text{prune}}$ Monte-Carlo samples.
Retain only the top $K_d$ (or those within a given standard deviation) among children.
Proceed down $h$ levels, ultimately yielding a substantially reduced candidate set of leaves.
Perform the full-budget diffusion evaluation only on these leaves.

Computational gains: The standard flat method evaluates $O(N_C M)$ scores per input. With leaf-based pruning, the number is $O(K N_C M)$ (for constant $K\ll1$ ), with empirical reductions of 40-60% in denoising steps observed for ImageNet-1K at mild to aggressive pruning ratios (Shanbhag et al., 18 Nov 2024).

4. Preservation of Structure and Model Quality

Tokenization determinism and reachability: Pruning is restricted to leaves, guaranteeing that all substrings previously covered remain tokenizable via existing or newly-exposed ancestor tokens. No internal node is eliminated before its children, and no unreachable tokens are introduced (in contrast to naïve or last- $N$ pruning). This ensures that all surviving tokens can be constructed from the atomic vocabulary set, preserving determinism (Purason et al., 3 Dec 2025).
Classification accuracy: In diffusion classifiers, leaf-based (hierarchical) pruning narrows the candidate evaluation set but maintains or improves top-$1$ accuracy at moderate pruning levels. For example, fixed $K_d=0.5$ pruning at each level yields Top-$1$ accuracy of 65.16% (vs. 64.90% baseline) on ImageNet-1K, with nearly 39% speedup. More aggressive (2 $\sigma$ -based) pruning still maintains competitive accuracy while accelerating inference by almost 60% (Shanbhag et al., 18 Nov 2024).

5. Empirical Results and Comparative Performance

5.1. Tokenizer Vocabulary Pruning

Method	Bytes-per-token (Estonian)	Bytes-per-token (English)	Unreachable tokens
Frequency Leaf Pruning	Matches/outperforms baselines	Minimal degradation	0
Naïve Frequency/Last-N	Inferior	Inferior	Thousands (high pruning rates)

Frequency-leaf pruning enables up to 62.5% reduction in vocabulary (e.g., ≈80k tokens in Llama-3.1-8B) without significant loss in machine translation, commonsense QA, topic classification, or causal reasoning tasks (Purason et al., 3 Dec 2025).

5.2. Diffusion Classifier Hierarchical Pruning

Method	Top-1 Accuracy [%]	Inference Time [s]	Speed-Up [%]
Baseline	64.90	1600	–
HDC (K=0.5)	65.16	980	38.75
HDC (2σ strategy)	63.33	650	59.38

Strategy 1 preserves or improves accuracy with a ~39% speed-up. Strategy 2 provides almost 60% speed-up with a ~1.6pp drop in accuracy. These gains are robust across Stable Diffusion 1.4/2.0/2.1 and different pruning granularities (Shanbhag et al., 18 Nov 2024).

6. Tuning and Trade-Offs

Pruning ratio ( $K_d$ ): Lower $K_d$ increases pruning aggressiveness, yielding faster inference or smaller vocabulary at potential risk to accuracy or tokenization quality. Fixed $K_d=0.5$ is empirically robust for many settings.
Dynamic $\sigma$ -based thresholds: Alternative thresholding where children within $σ$ standard deviations of the best error are kept enables adaptive pruning rates and fine-grained control over the accuracy/speed trade-off.
Monte-Carlo budgets ( $M_{\text{prune}}, M$ ): Using reduced sampling during the initial pruning phase (e.g., $M_{\text{prune}}=16$ vs. a full $M=128$ for final decision) accelerates selection with negligible effect on final accuracy (Shanbhag et al., 18 Nov 2024).
Corpus frequency estimation: For tokenizer pruning, correct estimation of $freq(t)$ is essential. Using an in-domain, representative corpus is critical to minimize impact on deployed applications (Purason et al., 3 Dec 2025).

A plausible implication is that leaf-based vocabulary pruning offers a principled, structure-aware means of adapting both label vocabularies and token sets for scalable, efficient inference in large models, without sacrificing accuracy or essential task coverage.

7. Broader Applications and Significance

Leaf-based vocabulary pruning generalizes to any setting in which the underlying combinatorial structure of the vocabulary is interpretable as a DAG or tree, and where maintaining coverage, determinism, and efficiency is desirable. Its application to pre-trained tokenizer adaptation and diffusion classifier acceleration illustrates its utility for both generative and discriminative model families. Furthermore, the structural guarantees—no unreachable tokens, no broken derivation chains—suggest that this methodology can be deployed safely in multilingual, domain-adaptive, or resource-constrained environments (Purason et al., 3 Dec 2025, Shanbhag et al., 18 Nov 2024).

PDF Markdown Chat (Pro)

References (2)

Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models (2025)

Just Leaf It: Accelerating Diffusion Classifiers with Hierarchical Class Pruning (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Leaf-Based Vocabulary Pruning.