Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SuperBPE: Space Travel for Language Models (2503.13423v2)

Published 17 Mar 2025 in cs.CL and cs.LG

Abstract: The assumption across nearly all LLM (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the potential of modern LMs? Whitespace is not a reliable delimiter of meaning, as evidenced by multi-word expressions (e.g., "by the way"), crosslingual variation in the number of words needed to express a concept (e.g., "spacesuit helmet" in German is "raumanzughelm"), and languages that do not use whitespace at all (e.g., Chinese). To explore the potential of tokenization beyond subwords, we introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm to first learn subwords, then superwords that bridge whitespace. This brings dramatic improvements in encoding efficiency: when fixing the vocabulary size to 200k, SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average. In experiments, we pretrain 8B transformer LMs from scratch while fixing the model size, vocabulary size, and train compute, varying only the algorithm for learning the vocabulary. Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks (including +8.2% on MMLU), while simultaneously requiring 27% less compute at inference time. In analysis, we find that SuperBPE results in segmentations of text that are more uniform in per-token difficulty. Qualitatively, this may be because SuperBPE tokens often capture common multi-word expressions that function semantically as a single unit. SuperBPE is a straightforward, local modification to tokenization that improves both encoding efficiency and downstream performance, yielding better LLMs overall.

SuperBPE is a tokenization algorithm presented in "SuperBPE: Space Travel for LLMs" (Liu et al., 17 Mar 2025 ) that modifies the standard Byte-Pair Encoding (BPE) process to allow for the creation of "superword" tokens, which are sequences that can span across whitespace boundaries. This contrasts with conventional BPE implementations that typically enforce a constraint where tokens must be subwords contained within whitespace-delimited word boundaries. The core innovation lies in a two-stage pretokenization curriculum during vocabulary construction.

Motivation

The development of SuperBPE stems from challenging the implicit assumption in most tokenization schemes that subword units, confined by whitespace, constitute the optimal atomic units for LLMs. The authors posit several limitations to this subword constraint:

  1. Semantic Units: Whitespace is often an inadequate delimiter for capturing complete semantic units. Multi-word expressions (MWEs), such as "by the way" or "New York City", function as single concepts but are fragmented by standard tokenizers.
  2. Cross-lingual Discrepancies: The granularity of concepts expressed per word varies across languages. For instance, the German "Raumanzughelm" corresponds to the multi-word English phrase "spacesuit helmet". Subword tokenizers might struggle to represent such cross-lingual equivalences efficiently.
  3. Whitespace-Agnostic Languages: Languages like Chinese, Japanese, and Thai do not rely on whitespace to delimit words. Tokenization methods for these languages necessarily operate without this constraint, successfully learning tokens that correspond to multi-character sequences representing words or concepts, suggesting that the whitespace constraint is not universally necessary or beneficial.

These observations led to the hypothesis that enabling tokens to bridge whitespace could yield more semantically meaningful units, improve encoding efficiency (reducing sequence lengths), lower computational demands, and ultimately enhance LLM performance.

Methodology

SuperBPE introduces a modification to the vocabulary learning phase of BPE by incorporating a pretokenization curriculum. Standard BPE typically involves:

  1. Initializing the vocabulary with individual bytes.
  2. Using whitespace pretokenization to split the training corpus into word-like units.
  3. Iteratively merging the most frequent adjacent token pairs within these units.
  4. Adding the new merged token to the vocabulary until a target size T is reached.

SuperBPE alters this process with a two-stage approach controlled by a transition point hyperparameter t (where 0 <= t <= T):

  1. Stage 1 (Subword Learning, 0 to t): The algorithm operates like standard BPE with whitespace pretokenization enabled. It learns merges and builds the vocabulary up to size t. This phase primarily captures frequent subword units.
  2. Stage 2 (Superword Learning, t to T): Starting from the vocabulary of size t learned in Stage 1, the algorithm continues the merging process, but disables whitespace pretokenization. This allows the BPE algorithm to consider and perform merges between adjacent tokens even if they are separated by whitespace in the original text (e.g., merging "by" and "the" if "by the" is frequent). This stage continues until the final target vocabulary size T is reached, adding superword tokens that span whitespace.

Setting t = T recovers standard BPE, while t = 0 implements BPE without any whitespace pretokenization from the start (which the paper shows performs suboptimally compared to the two-stage approach). The choice of t allows balancing the learning of subword and superword units.

Experimental Setup

The effectiveness of SuperBPE was evaluated by pretraining 8B parameter Transformer LMs from scratch. Key aspects of the experimental design include:

  • Model Architecture & Training: Based on the OLMo-7B configuration and training regime, using a subset (~330B tokens) of the Dolma v1.7 dataset.
  • Controlled Comparison: To isolate the effect of the tokenizer, experiments strictly controlled for:
    • Model parameter count (8B parameters).
    • Vocabulary size (T = 200,000 for all tokenizers).
    • Total training compute (FLOPs). Models using more efficient tokenizers (like SuperBPE) processed the same amount of text information per context window but had shorter sequence lengths in tokens. Their training steps were adjusted downwards to match the total FLOPs of the BPE baseline.
  • Tokenizers: Compared standard BPE (t=T=200k) against SuperBPE with varying transition points (t = 80k, 160k, 180k).
  • Evaluation Metrics:
    • Downstream performance: Assessed on a diverse suite of 30 tasks covering reasoning, knowledge, coding, comprehension, etc.
    • Encoding efficiency: Measured in bytes per token (BPT).
    • LLMing quality: Measured using bits per byte (BPB).
    • Inference efficiency: Measured in FLOPs required to process a fixed amount of text.

Results

The experiments demonstrated significant advantages for SuperBPE:

  1. Encoding Efficiency: SuperBPE achieved markedly higher encoding efficiency. With a 200k vocabulary, the best SuperBPE variant (t=80k) reached 6.63 BPT compared to 4.45 BPT for standard BPE, representing up to 33% fewer tokens for the same text. Efficiency gains continued with larger vocabularies for SuperBPE, unlike BPE which plateaued earlier.
  2. Downstream Performance: The 8B LM trained with SuperBPE (t=180k proving optimal for downstream tasks despite slightly lower BPT than t=80k) showed substantial improvements over the 8B BPE baseline, achieving an average +4.0% absolute gain across 30 evaluated tasks. Notably strong improvements were observed on knowledge-intensive and commonsense reasoning tasks, including +8.2% on MMLU, +21.2% on OpenbookQA, and +20.3% on CommonsenseQA. The SuperBPE model outperformed the baseline on 25 out of 30 tasks.
  3. Inference Compute Reduction: Due to the shorter sequence lengths resulting from higher encoding efficiency, the 8B SuperBPE models required 27-33% fewer FLOPs during inference compared to the 8B BPE model when processing equivalent amounts of input text.
  4. LLMing (BPB): An interesting finding was that the SuperBPE model (t=180k) which performed best downstream exhibited slightly worse (higher) BPB compared to the BPE baseline. An 11B SuperBPE model, scaled to match the inference FLOPs of the 8B BPE model, achieved better BPB than both 8B models. This suggests that under fixed compute/parameter constraints, BPB may not perfectly correlate with downstream utility when comparing fundamentally different tokenization schemes like BPE and SuperBPE.
  5. Token Difficulty Analysis: Analysis of per-token prediction loss revealed that SuperBPE models have a more uniform distribution of difficulty. Compared to BPE, SuperBPE has fewer tokens with very low loss (presumably common subwords now merged into larger, slightly harder-to-predict superwords) and, crucially, fewer tokens with very high loss. This reduction in worst-case prediction difficulty might contribute to the improved downstream performance. Qualitative analysis showed SuperBPE tokens often correspond to MWEs (e.g., "such as the", "of course ,").

Implementation Considerations

Implementing SuperBPE involves modifying the BPE vocabulary training script. Most existing BPE implementations (like sentencepiece or tokenizers library) rely on pretokenization rules applied before the iterative merging process.

  1. Two-Stage Training: The core change is to run the BPE training process in two stages.
    • Stage 1: Use standard whitespace pretokenization and train until the vocabulary reaches size t. Save this intermediate vocabulary and merge rules.
    • Stage 2: Reload the vocabulary and merge rules from Stage 1. Continue training from this state up to the final size T, but disable whitespace pretokenization for this stage. This typically involves changing a configuration flag or removing the pretokenizer component for the second stage.
  2. Hyperparameter t: The transition point t needs to be chosen. The paper found t=180k (for T=200k) optimal for downstream tasks with 8B models, suggesting that a significant portion of the vocabulary should still be dedicated to subwords before allowing superword merges. Optimal t might depend on T, the model size, and the training data.
  3. Compatibility: SuperBPE requires no changes to the model architecture or the inference decoding logic. Existing Transformer implementations can use a SuperBPE tokenizer seamlessly. The primary difference manifests as shorter input/output sequences for the same text content.
  4. Computational Cost (Training): Training a SuperBPE vocabulary involves the same algorithmic complexity as standard BPE, requiring negligible additional compute compared to the cost of pretraining the LM itself.
  5. Computational Cost (Inference): The main benefit is reduced inference cost. Since sequence length LL is shorter for SuperBPE (LSuperBPE<LBPEL_{SuperBPE} < L_{BPE}), the computational complexity of attention (O(L2)O(L^2)) and MLP layers (O(L)O(L)) is significantly reduced for the same amount of processed text.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def train_bpe_stage(corpus, initial_vocab, target_vocab_size, use_whitespace_pretok=True):
    # Initialize BPE state with initial_vocab
    # Set pretokenizer based on use_whitespace_pretok flag
    # Loop:
    #   Find most frequent adjacent pair based on current vocab and corpus stats
    #   Merge pair, add new token to vocab
    #   Update corpus stats
    # Until vocab size reaches target_vocab_size
    # Return final vocabulary and merge rules
    pass

corpus_files = [...]
T = 200000 # Final vocab size
t = 180000 # Transition point

initial_vocab_bytes = get_initial_byte_vocab(corpus_files)
vocab_stage1, merges_stage1 = train_bpe_stage(
    corpus=corpus_files,
    initial_vocab=initial_vocab_bytes,
    target_vocab_size=t,
    use_whitespace_pretok=True
)

vocab_final, merges_final = train_bpe_stage(
    corpus=corpus_files,
    initial_vocab=vocab_stage1, # Start from stage 1 vocab
    target_vocab_size=T,
    use_whitespace_pretok=False # Disable whitespace pretokenization
)

save_tokenizer(vocab_final, combined_merges)

In conclusion, SuperBPE offers a simple yet effective modification to BPE tokenization. By relaxing the strict subword constraint via a controlled two-stage learning process, it achieves substantially better encoding efficiency, leading to significant reductions in inference compute requirements and notable improvements in downstream task performance for LLMs trained with fixed resources.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alisa Liu (25 papers)
  2. Jonathan Hayase (20 papers)
  3. Valentin Hofmann (21 papers)
  4. Sewoong Oh (128 papers)
  5. Noah A. Smith (224 papers)
  6. Yejin Choi (287 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

  1. SuperBPE (13 points, 0 comments)