Bits Per Character (BPC)
- Bits Per Character (BPC) is a metric that measures the average bits required per character to encode text, offering a clear gauge of model predictive power.
- It is computed from the cross-entropy between actual data and predicted probabilities, ensuring comparisons are independent of tokenization choices.
- Empirical benchmarks reveal that lower BPC scores correlate with improved model architectures and compression capabilities in both natural language and code domains.
Bits Per Character (BPC) is a foundational metric for character-level language modeling, measuring the average number of bits required to encode each character in text under a predictive model. BPC is derived from cross-entropy between the data distribution and a model’s predicted distribution, providing an information-theoretic lens on model performance and optimal compression. It has become the standard for evaluating sequential models—RNNs, LSTMs, transformers, and code LLMs—on text and code corpora, enabling principled, tokenization-invariant comparisons and benchmarking against both human predictions and algorithmic compressors.
1. Formal Definition and Mathematical Foundations
Let denote a sequence of characters from an alphabet with symbols. If a model assigns a predictive distribution to each , the average cross-entropy (in nats) is
where the logarithm is natural log. To express this average information in bits rather than nats, divide by :
Equivalently, using log-base-2:
In information-theoretic terms, BPC estimates the Shannon entropy 0 for the true data distribution, representing the average code length per character for optimal encoding. For model evaluation, lower BPC signifies higher predictive power and lower redundancy in the representation (Al-Rfou et al., 2018, Dangovski et al., 2017, Lavreniuk et al., 30 Apr 2026, Rocki, 2016, Blevins et al., 2019, Mujika et al., 2017, Nouri, 26 Feb 2026, Luo et al., 16 May 2025).
2. Computation and Protocols for BPC Measurement
BPC computation is standardized across most works:
- The model predicts, at every time step, the conditional distribution over next characters.
- On an unseen test sequence, log-probabilities assigned to each actual next character are recorded and summed.
- The aggregate is divided by the sequence length, yielding average BPC.
The process applies to various corpora: text8 and enwik8/9/10 for English Wikipedia, character-level Penn Treebank (PTB), and task-specific datasets for code and morphology (Al-Rfou et al., 2018, Rocki, 2016, Mujika et al., 2017, Nouri, 26 Feb 2026). For large-scale code, a sliding-window protocol is used to handle models with limited context windows: the validation corpus is scanned in overlapping windows; log-probabilities are measured only on non-overlapping segments to avoid bias in BPC when context sizes differ (Luo et al., 16 May 2025).
For subword or byte-level tokenizers (as in BPE or variant schemes), BPC is computed by multiplying the token-level negative log-likelihood (in nats) by the average tokens-per-character (TPC) and then dividing by 1 to convert to bits:
2
This procedure ensures that BPC is a tokenizer-invariant metric, allowing fair comparisons independent of vocabulary choice (Nouri, 26 Feb 2026).
3. Benchmark Results and Empirical Findings
BPC is a standard axis for reporting language modeling advances. Representative results include:
| Model | Dataset | BPC | Reference |
|---|---|---|---|
| 64-layer Transformer | text8 | 1.13 | (Al-Rfou et al., 2018) |
| enwik8 | 1.06 | (Al-Rfou et al., 2018) | |
| FS-LSTM-2 | PTB | 1.190 | (Mujika et al., 2017) |
| FS-LSTM-4 (ensemble) | enwik8 | 1.198 | (Mujika et al., 2017) |
| FS-RUM-2 | PTB | 1.189 | (Dangovski et al., 2017) |
| Array-LSTM (stochastic) | enwik8 | 1.402 | (Rocki, 2016) |
| mLSTM + dyn. eval | text8 | 1.19 | (Al-Rfou et al., 2018) |
| cmix9 (compressor) | enwik8 | 1.25 | (Rocki, 2016) |
| LLM (Gemma-3, 27B) | Ukrainian | 0.717 | (Lavreniuk et al., 30 Apr 2026) |
State-of-the-art models—transformers leveraging deep self-attention, recurrent architectures with multiscale or rotational memory, and multi-cell LSTM variants—consistently advance empirical BPC results, especially on benchmarks emphasizing long-range dependency modeling (Al-Rfou et al., 2018, Dangovski et al., 2017, Mujika et al., 2017, Rocki, 2016). For human languages, LLMs often outperform the Shannon entropy upper bounds established by human-prediction protocols (Lavreniuk et al., 30 Apr 2026).
4. Interpretation, Theory, and Information-Theoretic Context
BPC is not just a model-specific metric; it is closely related to fundamental information-theoretic limits:
- For an alphabet of size 3, the uniform-entropy upper bound is 4.
- Natural language BPC is much lower due to redundancy; e.g., English has a human upper bound 5 (Lavreniuk et al., 30 Apr 2026).
Shannon’s original experiments estimated language entropy via next-character human prediction, providing upper and lower bounds for 6 (Lavreniuk et al., 30 Apr 2026). In the Ukrainian study, 7 bits/char, closely matching English and Hebrew results, and aligning with the empirical BPC reached by contemporary LLMs.
For compression, by Shannon’s source coding theorem, BPC directly quantifies the theoretically optimal code length per character. In practice, the best compressors engineer algorithms to approach these bounds; neural LLMs evaluated under BPC both measure and approach optimal compression (Rocki, 2016, Mujika et al., 2017).
In code modeling, BPC exhibits a near-logarithmic relationship with multi-task "code intelligence" scores: lower BPC correlates with higher aggregate code capability, though the relationship is not strictly linear as hypothesized in earlier work (Luo et al., 16 May 2025).
5. Architectural Factors and BPC Reduction Techniques
Key architectural advances have driven BPC lower:
- Deep self-attention (transformers): Stacking many transformer layers with learned per-layer positional embeddings and carefully designed auxiliary losses enables direct conditioning on long contexts, overcoming truncated BPTT limitations in RNNs/LSTMs (Al-Rfou et al., 2018).
- Multiscale recurrence: Fast-Slow RNNs, which explicitly separate memory across timescales, retain long-term patterns and adapt quickly to local input, reducing BPC compared to traditional stacked or sequential RNNs (Mujika et al., 2017, Dangovski et al., 2017).
- Array and rotational memories: Array-LSTMs and Rotational Units of Memory increase parallel storage and enable unitary (orthogonal) transitions, preserving gradient flow and stabilizing long-sequence training, thus yielding lower BPC (Rocki, 2016, Dangovski et al., 2017).
- Morphological multitasking: Augmenting character-level models with supervision for morphological tags systematically reduces BPC across diverse languages, with inflected forms benefiting most (Blevins et al., 2019).
- BPE and statistical tokenization: BPC allows quantitative evaluation of tokenizer design choices. Statistically motivated merge criteria (e.g., Significance-Gain BPE) achieve marginal but consistent BPC improvements over purely frequency-based merges, especially as vocabulary sizes vary (Nouri, 26 Feb 2026).
Auxiliary losses (multiple-position and intermediate-layer predictions), deep stacking, and regularization are shown to be critical for convergence at high depths and further drive BPC down in deep causal transformers (Al-Rfou et al., 2018).
6. Benchmarking, Tokenizer-Invariance, and Cross-Domain Applicability
BPC is inherently invariant to tokenization when properly computed, enabling fair comparison between models using different vocabularies or segmentation schemes (Nouri, 26 Feb 2026):
- Per-token metrics (perplexity, NLL) can vary arbitrarily with the choice of tokens.
- BPC, rooted in raw character length, is stable across such changes, which is essential for evaluating models spanning byte, character, or subword domains.
- In code, language, and multi-domain modeling, BPC is the preferred metric for reliable benchmarking (Luo et al., 16 May 2025, Nouri, 26 Feb 2026).
Empirical values track both the inherent entropy of the data and modeling power; thus, BPC summarizes performance in contexts ranging from natural language to code corpora, and can be directly compared to algorithmic compressors and human prediction experiments.
7. Limitations, Practical Implications, and Connections to Intelligence
While BPC tightly quantifies next-character prediction performance and compression efficiency, its scope has practical and theoretical limits:
- It does not directly measure semantic coherence, reasoning, or higher-level structure—only local predictive accuracy (Luo et al., 16 May 2025).
- Some redundancy in BPC remains even in state-of-the-art models, indicating deeper structure is still not fully captured (Rocki, 2016).
- Benchmarking against Shannon entropy estimates underscores that substantial compression improvements may still be possible; LLMs sometimes even surpass human-predictive BPC but may do so via overfitting to distributional idiosyncrasies rather than true linguistic competence (Lavreniuk et al., 30 Apr 2026).
- In code, the empirical connection between BPC and "code intelligence" is strong but nonlinear. Lower BPC does not guarantee human-level reasoning but correlates with improved aggregate benchmark scores when measured across multi-task, multi-language suites (Luo et al., 16 May 2025).
A plausible implication is that further BPC gains may require architectural or objective innovations that move beyond shallow local dependencies, toward deeper linguistic or semantic modeling.
In summary, Bits Per Character is a mathematically rigorous, empirically robust, and widely adopted metric for character-level modeling. Its principled construction allows detailed benchmarking across models, domains, and languages, while its limits highlight ongoing challenges in capturing the true complexity of human language and code (Al-Rfou et al., 2018, Dangovski et al., 2017, Rocki, 2016, Mujika et al., 2017, Nouri, 26 Feb 2026, Blevins et al., 2019, Luo et al., 16 May 2025, Lavreniuk et al., 30 Apr 2026).