Bits-Per-Byte (BPB): Compression & Tokenisation

Updated 21 August 2025

Bits-Per-Byte (BPB) is a metric from information theory that measures the average bits needed per symbol, reflecting coding efficiency and compression quality.
Information-driven tokenisation methods, like ByteSpan, leverage entropy and surprisal to group bytes into linguistically meaningful subwords, outperforming frequency-based approaches.
BPB efficiency in multilingual contexts is enhanced through tailored language quotas, ensuring balanced compression performance and improved morphological alignment across scripts.

Bits-Per-Byte (BPB) is a metric rooted in information theory that quantifies the average number of bits needed to represent each byte (or more generally, symbol) of a data stream or model output after compression or encoding. It serves as a practical indicator of coding efficiency and model predictability, and is increasingly used to evaluate tokenisation, data compression, and subword encoding schemes in both monolingual and multilingual contexts.

1. Information-Driven Tokenisation: Principles and Algorithmic Design

Traditional subword tokenisation methods—especially Byte Pair Encoding (BPE)—rely on frequency-based, iterative merging of byte or character pairs to maximise the frequency-mass of each merged unit. In contrast, ByteSpan introduces an information-driven grouping mechanism, operationalised using an external byte-level LLM (LM) to calculate the information content (e.g., entropy H(bₜ)) of each byte in the training corpus. Two main constraints are applied:

Global Constraint: Group bytes if H(bₜ) < τ_g, with τ_g a fixed threshold.
Monotonic Constraint: Group contiguous bytes as long as H(bₜ) − H(bₜ₋₁) < 0.

A combined criterion can also be employed to patch contiguous spans of bytes that either fall below the information threshold or preserve locally monotonicity in decreasing information. This information-driven grouping results in variable-length subword units (“spans”) that tend to map to predictable sections of the byte stream, naturally aligning with boundaries of low surprisal.

This tokenisation scheme yields a vocabulary that is both compact and linguistically relevant, as spikes in the byte-level LM’s prediction error—used as segmentation cues—tend to coincide with morphemic or lexical boundaries.

2. BPB Efficiency, Compression, and Morphological Alignment

Although the term "Bits-Per-Byte" (BPB) is not always used directly, ByteSpan’s efficiency is assessed via metrics such as R enyi efficiency and fertility, which are strong proxies for compression and BPB. The BPB perspective is as follows:

Compression: Grouping low-information (predictable) byte spans produces longer average tokens, thus reducing the total number of tokens required to encode a sequence. This compressibility is reflected in high R efficiency, which characterises how closely the token distribution approaches a maximally efficient communication channel (i.e., near-minimal BPB).
Morphological Alignment: The grouping algorithm, tuned to information-theoretic spikes, produces tokens whose boundaries correspond more closely to morphological (especially morphemic) units, as opposed to the often arbitrary or frequency-driven merges in BPE. This morphological alignment is beneficial for tasks that profit from linguistic structure, ensuring that the compression achieved by ByteSpan does not come at the expense of linguistic relevance.

This suggests that the information-driven paradigm enables efficient encoding (low BPB) that is better adapted to the linguistic realities of a given language than frequency-based alternatives.

3. Multilingual Compression and Efficiency

Extending ByteSpan to a multilingual setting with 25 languages, the BPB efficiency—measured via R efficiency and fertility—remains high and competitive with multilingual BPE.

R efficiency: Across languages, ByteSpan maintains similar values to BPE, indicating comparable BPB from a compression viewpoint.
Fertility: Defined as the average number of subword tokens per word (lower values reflect better compression), ByteSpan generally achieves parity with BPE in most languages.

A notable phenomenon is that non-Latin or underrepresented scripts (Arabic, Chinese, Hebrew, Hindi, Japanese, Korean) may initially display higher fertility when directly grouped by the global information criterion, as the byte-level LM can be less well-calibrated for these scripts. To mitigate this, per-language token quotas—where each language receives a fixed proportion of the overall vocabulary—significantly improve compression (fertility) for these languages, though sometimes at a small trade-off in Latin-script languages.

This multilingual robustness confirms that ByteSpan’s method is not only effective for individual languages but also scalable to multilingual scenarios requiring equitable vocabulary allocation.

4. Comparison with Frequency-Driven Schemes (e.g., BPE)

ByteSpan stands in explicit methodological contrast to BPE, which merges the most frequent adjacent pairs without regard to the actual information content or predictability. The key distinctions are as follows:

Feature	BPE (Frequency-driven)	ByteSpan (Information-driven)
Merging Criterion	Frequency of bigrams	Entropy or surprisal (predictability)
Morphological alignment	Often arbitrary	Improved via information peaks
Compression/BPB	Good, not linguistically informed	Comparable, with better linguistic quality
Vocabulary balancing (multilingual)	Via frequency mass	Via language-specific quotas guided by information

The information-driven approach ensures that predictable bytes are grouped, creating subwords that are both efficient for compression (BPB) and linguistically meaningful.

5. Constraints, Trade-offs, and Adaptation

The entropy-based grouping uses global and monotonic constraints, but hybrid or patched constraints may better accommodate noisy or variable LM signal, especially in multilingual or under-resourced settings. The observed trade-off is that Latin-script languages may experience marginal loss in compression performance when quotas preferentially support rarer scripts; however, this adjustment yields substantial gains for those scripts.

This suggests that practical deployment of ByteSpan-like tokenisers should consider dynamic or adaptive balancing mechanisms to ensure uniform BPB efficiency and linguistic alignment across languages.

6. Implications and Broader Impact

ByteSpan demonstrates that information-theoretic signals from pretrained byte-level LMs can guide subword construction to produce vocabularies that are simultaneously efficient (yielding favorable BPB metrics) and morphologically aligned. This dual optimisation is particularly compelling in multilingual contexts, where one must balance compression with equitable and linguistically appropriate vocabulary coverage.

The information-driven subword construction provides a path toward tokenisation schemes that are agnostic to superficial frequency statistics and instead directly exploit the structural predictability inherent in human language and script. This orientation may inform future research on language-agnostic or cross-lingual sequence modeling and shed light on connections between predictive coding, entropy minimisation, and efficient representation in both natural and artificial languages.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Bits-Per-Byte (BPB).