Bits Per Token (BPT) Overview
- Bits Per Token (BPT) is a quantitative metric that measures the average number of information bits per token, influencing model efficiency and compression.
- BPT facilitates comparisons among byte-level, bit-level, and numerical tokenization schemes by balancing vocabulary size, sequence length, and computational cost.
- Practical applications include optimizing LLM architectures for numeracy and multilingual tasks, with empirical results showing improved inference speed and reduced token count.
Bits Per Token (BPT) is a quantitative metric that measures the average number of information bits—either raw or semantic—conveyed by a single token in a LLM’s tokenization scheme. BPT serves as a critical lens for evaluating both the efficiency of different tokenization strategies and their impact on the computational and representational properties of LLMs. BPT is especially relevant when considering lossless compression, out-of-vocabulary (OOV) robustness, arithmetic fidelity, and trade-offs between vocabulary size, sequence length, and model compute cost. Recent research formalizes precise definitions, provides closed-form formulas, outlines direct empirical comparisons, and details architectural innovations that seek to optimize BPT in contexts ranging from general multilingual text to numerically intensive tasks (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025).
1. Formal Definition of Bits Per Token
BPT is formally defined as the ratio of total information bits used to encode an input to the number of tokens emitted during tokenization: This definition generalizes across tokenization schemes, encoding modalities, and semantic targets. In byte-level BPE (Byte-Pair Encoding), each token is precisely one byte, yielding bits. For advanced schemes such as BitTokens, which encode IEEE 754 double-precision numbers into single tokens, BPT can reach 64 or even 128 bits/token, commensurate with the precision and dimensionality of the embedded number representation (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025).
2. BPT in Byte-Level and Bit-Level Tokenization Schemes
Byte-level fallback is a prevalent solution for eliminating OOV errors in LLMs—by decomposing Unicode characters into UTF-8 byte sequences, each token captures 8 bits. However, this increases sequence length for languages with multi-byte characters (e.g., Chinese, Japanese, Korean [CJK]). Bit-level BPE addresses this by further compressing UTF-8 via prefix extraction and widened trailing-bit tokens:
- Shared prefix width (): frequently bits.
- Trailing subword width (): typically bits.
The BPT for bit-level BPE is bounded: where is the prefix token count and is total token count. For , : Average values in CJK corpora are empirically contained in $8.2$–$8.3$ bits/token, exceeding byte-level BPE due to run-length prefix reuse (Moon et al., 9 Jun 2025).
3. BPT in Numeric Representation and BitTokens
Conventional string or subword tokenizations of numbers are inefficient: a double-precision number (64 bits) often spans multiple tokens—each token carrying at most bits, where , yielding bits/token. Thus, semantic BPT is typically in the 8–16 range for numerals.
BitTokens encode each real number as a single token using its full IEEE 754 binary representation (1 sign bit, 11 exponent bits, 52 fraction bits), leading to: With extended encoding (including the reciprocal 1/x), inputs reach 128 bits, i.e., (Kreitner et al., 8 Oct 2025).
BPT Comparison Table
| Scheme | BPT (bits/token) | Semantic Context |
|---|---|---|
| Byte-level BPE | 8 | UTF-8 bytes/OOV fallback |
| Bit-level BPE | 6–9 (avg 8.3) | CJK/emoji/non-ASCII |
| BitTokens | 64 (or 128) | IEEE 754 numbers |
| Subword BPE | ~8–16 | Float/digit sequences |
4. Practical Impact and Empirical Observations
Reduction in BPT correlates with shorter sequence lengths for information-equivalent encoding, directly reducing transformer compute and memory bandwidth. For CJK-heavy datasets, bit-level BPE achieves:
- 3.13% shorter sequences in Chinese
- 0.83% in Japanese
- 2.21% in Korean
The bits-per-character (BPC) metric indicates savings of $0.2$–$0.8$ bits/char, or 3%–6% relative reduction. On numeracy benchmarks, BitTokens yield 4×–8× BPT improvements over subword BPE, and 8×–16× over byte-level BPE, enabling even small LLMs (e.g., 117M parameters) to learn addition, multiplication, and division nearly perfectly in multi-task settings (log-sMAPE ) (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025). Concretely:
- Single-token number encoding allows arithmetic tasks with 6–7 tokens per example (vs. 30–60 for BPE).
- Wall-clock inference speed improves up to when accounting for reduced sequence lengths and "perceived TPS" (tokens/sec), as token count is the dominant bottleneck for long-tail or numeric inputs.
5. Algorithmic and Architectural Considerations
Tokenization strategies optimized for BPT entail architectural changes:
- Bit-level BPE requires maintaining run-length encoded prefixes, splitting streams into nonuniform bit tokens, and precise bitwise manipulations. This necessitates augmented encoding/decoding logic and careful embedding table design for the expanded vocabulary ( tokens instead of 256).
- BitTokens embed bitstrings as sign-mantissa-exponent vectors, requiring a specialized number head and regex-driven serialization, along with normalization and careful handling of IEEE 754 corners (e.g., 0, , NaN).
- For both, retrofitting pre-trained LLMs with new token types may induce under-training and catastrophic forgetting unless meticulously initialized.
6. Limitations and Trade-offs
Optimizing for high or low BPT introduces multiple trade-offs:
- Expanded vocabulary inflates embedding/softmax size and may reduce token entropy (Rényi efficiency).
- Increased encoder/decoder (de)serialization complexity; despite modest overhead, this introduces integration barriers for existing pre-trained architectures.
- For BitTokens, numerical robustness is high, but discontinuities in bitwise loss and lack of support for scientific notation or rare IEEE 754 patterns can yield decoding or learning pathologies.
- In multi-step or compositional numeracy tasks, bottlenecks often shift from representation (BPT) to model algorithmic capacity, as shown by partial failures on mean/std tasks even with high BPT (Kreitner et al., 8 Oct 2025).
7. Theoretical and Empirical Significance
High BPT, as achieved by BitTokens, is directly linked to increased representational efficiency, lower token requirements for numerically intensive tasks, and improved throughput, while low BPT (as in byte-level or subword schemes) facilitates fine-grained language representation and OOV resilience but may increase processing overhead for long-tail character sets or numbers. BPT thus offers a rigorous, model-agnostic framework for quantifying and optimizing the information-theoretic profile of LLM tokenizers, with immediate consequences for resource allocation, training cost, and downstream accuracy in diverse textual and numeric domains (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free