Bits Per Token (BPT) Overview

Updated 11 November 2025

Bits Per Token (BPT) is a quantitative metric that measures the average number of information bits per token, influencing model efficiency and compression.
BPT facilitates comparisons among byte-level, bit-level, and numerical tokenization schemes by balancing vocabulary size, sequence length, and computational cost.
Practical applications include optimizing LLM architectures for numeracy and multilingual tasks, with empirical results showing improved inference speed and reduced token count.

Bits Per Token (BPT) is a quantitative metric that measures the average number of information bits—either raw or semantic—conveyed by a single token in a LLM’s tokenization scheme. BPT serves as a critical lens for evaluating both the efficiency of different tokenization strategies and their impact on the computational and representational properties of LLMs. BPT is especially relevant when considering lossless compression, out-of-vocabulary (OOV) robustness, arithmetic fidelity, and trade-offs between vocabulary size, sequence length, and model compute cost. Recent research formalizes precise definitions, provides closed-form formulas, outlines direct empirical comparisons, and details architectural innovations that seek to optimize BPT in contexts ranging from general multilingual text to numerically intensive tasks (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025).

1. Formal Definition of Bits Per Token

BPT is formally defined as the ratio of total information bits used to encode an input to the number of tokens emitted during tokenization: $\mathrm{BPT} = \frac{\text{Total bits used}}{\text{Number of tokens emitted}}$ This definition generalizes across tokenization schemes, encoding modalities, and semantic targets. In byte-level BPE (Byte-Pair Encoding), each token is precisely one byte, yielding $\mathrm{BPT}_0 = 8$ bits. For advanced schemes such as BitTokens, which encode IEEE 754 double-precision numbers into single tokens, BPT can reach 64 or even 128 bits/token, commensurate with the precision and dimensionality of the embedded number representation (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025).

2. BPT in Byte-Level and Bit-Level Tokenization Schemes

Byte-level fallback is a prevalent solution for eliminating OOV errors in LLMs—by decomposing Unicode characters into UTF-8 byte sequences, each token captures 8 bits. However, this increases sequence length for languages with multi-byte characters (e.g., Chinese, Japanese, Korean [CJK]). Bit-level BPE addresses this by further compressing UTF-8 via prefix extraction and widened trailing-bit tokens:

Shared prefix width ( $b_p$ ): frequently $b_p = 6$ bits.
Trailing subword width ( $b_t$ ): typically $b_t = 9$ bits.

The BPT for bit-level BPE is bounded: $\mathrm{BPT}_{\text{prop}} = b_t - (b_t - b_p)\frac{N_p}{N_t} \qquad (1)$ where $N_p$ is the prefix token count and $N_t$ is total token count. For $b_p=6$ , $b_t=9$ : $6 \leq \mathrm{BPT}_{\text{prop}} \leq 9$ Average values in CJK corpora are empirically contained in $8.2$–$8.3$ bits/token, exceeding byte-level BPE due to run-length prefix reuse (Moon et al., 9 Jun 2025).

3. BPT in Numeric Representation and BitTokens

Conventional string or subword tokenizations of numbers are inefficient: a double-precision number (64 bits) often spans multiple tokens—each token carrying at most $\log_2|\mathcal V|$ bits, where $|\mathcal V|\approx 50,000$ , yielding $\log_2|\mathcal V| \approx 15.6$ bits/token. Thus, semantic BPT is typically in the 8–16 range for numerals.

BitTokens encode each real number $x \in \mathbb R$ as a single token using its full IEEE 754 binary representation (1 sign bit, 11 exponent bits, 52 fraction bits), leading to: $\mathrm{BPT}_{\text{BitToken}} = \frac{64}{1} = 64$ With extended encoding (including the reciprocal 1/x), inputs reach 128 bits, i.e., $\mathrm{BPT}=128$ (Kreitner et al., 8 Oct 2025).

BPT Comparison Table

Scheme	BPT (bits/token)	Semantic Context
Byte-level BPE	8	UTF-8 bytes/OOV fallback
Bit-level BPE	6–9 (avg 8.3)	CJK/emoji/non-ASCII
BitTokens	64 (or 128)	IEEE 754 numbers
Subword BPE	~8–16	Float/digit sequences

4. Practical Impact and Empirical Observations

Reduction in BPT correlates with shorter sequence lengths for information-equivalent encoding, directly reducing transformer compute and memory bandwidth. For CJK-heavy datasets, bit-level BPE achieves:

3.13% shorter sequences in Chinese
0.83% in Japanese
2.21% in Korean

The bits-per-character (BPC) metric indicates savings of $0.2$–$0.8$ bits/char, or 3%–6% relative reduction. On numeracy benchmarks, BitTokens yield 4×–8× BPT improvements over subword BPE, and 8×–16× over byte-level BPE, enabling even small LLMs (e.g., 117M parameters) to learn addition, multiplication, and division nearly perfectly in multi-task settings (log-sMAPE $>0.98$ ) (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025). Concretely:

Single-token number encoding allows arithmetic tasks with 6–7 tokens per example (vs. 30–60 for BPE).
Wall-clock inference speed improves up to $\sim5\%$ when accounting for reduced sequence lengths and "perceived TPS" (tokens/sec), as token count is the dominant bottleneck for long-tail or numeric inputs.

5. Algorithmic and Architectural Considerations

Tokenization strategies optimized for BPT entail architectural changes:

Bit-level BPE requires maintaining run-length encoded prefixes, splitting streams into nonuniform bit tokens, and precise bitwise manipulations. This necessitates augmented encoding/decoding logic and careful embedding table design for the expanded vocabulary ( $\sim512$ tokens instead of 256).
BitTokens embed bitstrings as sign-mantissa-exponent vectors, requiring a specialized number head and regex-driven serialization, along with normalization and careful handling of IEEE 754 corners (e.g., $\pm$ 0, $\pm\infty$ , NaN).
For both, retrofitting pre-trained LLMs with new token types may induce under-training and catastrophic forgetting unless meticulously initialized.

6. Limitations and Trade-offs

Optimizing for high or low BPT introduces multiple trade-offs:

Expanded vocabulary inflates embedding/softmax size and may reduce token entropy (Rényi efficiency).
Increased encoder/decoder (de)serialization complexity; despite modest overhead, this introduces integration barriers for existing pre-trained architectures.
For BitTokens, numerical robustness is high, but discontinuities in bitwise loss and lack of support for scientific notation or rare IEEE 754 patterns can yield decoding or learning pathologies.
In multi-step or compositional numeracy tasks, bottlenecks often shift from representation (BPT) to model algorithmic capacity, as shown by partial failures on mean/std tasks even with high BPT (Kreitner et al., 8 Oct 2025).

7. Theoretical and Empirical Significance

High BPT, as achieved by BitTokens, is directly linked to increased representational efficiency, lower token requirements for numerically intensive tasks, and improved throughput, while low BPT (as in byte-level or subword schemes) facilitates fine-grained language representation and OOV resilience but may increase processing overhead for long-tail character sets or numbers. BPT thus offers a rigorous, model-agnostic framework for quantifying and optimizing the information-theoretic profile of LLM tokenizers, with immediate consequences for resource allocation, training cost, and downstream accuracy in diverse textual and numeric domains (Moon et al., 9 Jun 2025, Kreitner et al., 8 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Bit-level BPE: Below the byte boundary (2025)

Efficient numeracy in language models through single-token number embeddings (2025)

Follow Topic

Get notified by email when new papers are published related to Bits Per Token (BPT).