Pretrained Battery Transformer (PBT)

Updated 25 December 2025

Pretrained Battery Transformer (PBT) is a foundation model that uses domain-knowledge encoded MoE layers and multi-level Transformers to predict lithium-ion battery cycle life.
It achieves robust performance with a test MAPE of 14.3% by integrating both intra-cycle and inter-cycle encoding techniques alongside soft and hard expert routing.
PBT enables efficient transfer learning via full fine-tuning and adapter strategies, making it effective in both data-rich and data-scarce environments.

The Pretrained Battery Transformer (PBT) is a large-scale foundation model designed for early prediction of lithium-ion battery (LIB) cycle life. Developed using domain-encoded mixture-of-expert (MoE) architectures and trained on the largest public collection of battery life datasets, PBT enables generalized, transferable, and state-of-the-art battery lifetime forecasting. Its architectural innovations and transfer learning paradigm address core challenges in battery data heterogeneity and scarcity, setting a new direction for universal battery lifetime prediction (Tan et al., 18 Dec 2025).

1. Model Architecture

PBT implements a multi-level Transformer system, leveraging a domain-knowledge-rich Mixture-of-Expert pipeline ("BatteryMoE") at both intra-cycle and inter-cycle scales. The architecture comprises the following components:

Input Encoding: Each battery's early life (first $S$ charge–discharge cycles) is represented as a time series, where each cycle is resampled to 300 points (voltage, current, capacity), yielding an array $X_i \in \mathbb{R}^{300 \times 3}$ per cycle.
BatteryMoE-CyclePatch Layer: Each cycle is tokenized into a $d$ -dimensional embedding through linear projection followed by a domain-knowledge-aware MoE layer:

$\hat{x}_i = \mathrm{BatteryMoE}( \mathrm{Linear}( \mathrm{Flatten}(X_i) ) ) \in \mathbb{R}^d.$

Intra-Cycle Encoder: Stacked ( $L_\mathrm{intra}$ ) BatteryMoE-feed-forward layers with residual connections extract per-cycle features, where each layer computes:

$H^{(l)} = \mathrm{LayerNorm}\left( H^{(l-1)} + \mathrm{BatteryMoE}_{FF}( H^{(l-1)} ) \right).$

Inter-Cycle Transformer Encoder: Stacked ( $L_\mathrm{inter}$ ) Transformer encoder layers with self-attention and BatteryMoE feed-forward networks operate on the sequence of cycle embeddings, integrating positional encoding.
Projection Head: A mixture-of-linear-heads aggregates the final cycle token embedding $z$ into a scalar life prediction $\hat{y}$ :

$\begin{aligned} g^{(\mathrm{proj})} &= \mathrm{softmax}(W_p z + b_p),\ \hat{y} &= \sum_{i=1}^{K_\mathrm{proj}} g^{(\mathrm{proj})}_i\, (w_i^\mathsf{T} z + b_i). \end{aligned}$

Key architectural dimensions: $L_\mathrm{intra} = L_\mathrm{inter} = 12$ ; $d = 256$ ; 8 attention heads per Transformer layer (Tan et al., 18 Dec 2025).

2. Domain-Knowledge-Encoded Mixture-of-Experts

The BatteryMoE mechanism allows PBT to combine general and highly specialized expertise during representation learning:

Expert Routing: Each BatteryMoE layer contains $K_\mathrm{total}$ $K_{total}$ expert networks. Routing logits are computed using both a soft encoder and a hard encoder.
- Soft Encoder: An LLM-based module generates an embedding from textual prompts encoding up to ten known aging factors (e.g., specifications, formation protocol, operation conditions), which is then passed through an MLP and mapped to gating logits.
- Hard Encoder: Discrete and continuous battery metadata (cathode, anode, format, temperature within ±5 °C) are used to mask gating logits, so only compatible experts contribute.

For each layer input $h$ :

$M(h) = \sum_{i=1}^{K_\mathrm{total}} g_i F_i(h),$

with gating weights $g_i$ set to zero for non-matching experts (Tan et al., 18 Dec 2025).

Ablation studies confirm that excluding hard or soft encoders leads to a 10–15% increase in mean absolute percentage error (MAPE). Substituting BatteryMoE with generic MoE degrades performance by ~8%. Incorporation of BatteryMoE compensates for sparse data, offering benefits equivalent to a 33–50% increase in training data (Tan et al., 18 Dec 2025).

3. Pretraining Data and Workflow

PBT was pretrained using 13 public LIB databases (837 batteries, 426 unique aging protocols), spanning significant diversity in chemistry (e.g., LFP, LCO, NCM111, NCA), format (cylindrical, pouch, prismatic), and cycle-life range (102–4999 cycles):

Dataset Examples	Cell Chemistries	Application Scope
CALCE, MATR, RWTH, HNEI, …	LFP, NCM, NCA, LCO, …	Research, industry, various

Each dataset is split 60/20/20% for training/validation/testing. The core objective is early prediction of cycle life given the first $N\leq 100$ cycles' data. Optimization uses AdamW with a learning rate of $2.5 \times 10^{-4}$ , batch size 128, dropout 0.05, and weight decay $10^{-2}$ . Training comprises ≈50,000 steps until validation loss stabilizes. The loss minimized is mean squared error:

$\mathcal{L}( \hat{y}, y ) = (\hat{y} - y )^2.$

On held-out test cycles, convergence yields a test MAPE of 14.3%, outperforming CPTransformer at 17.8% (a 19.8% mean relative improvement). In-distribution ("seen") and out-of-distribution ("unseen") splits both favor PBT across all datasets; single-cycle input scenarios also show significant gains (PBT: 16.5% MAPE, baseline: 24.0%) (Tan et al., 18 Dec 2025).

4. Transfer Learning and Adapter Strategies

PBT supports parameter-efficient transfer to novel chemistries, protocols, or field data:

Full Fine-Tuning: All parameters may be updated when sufficient labeled target data (as few as 7–20 cells) is available. Batch size and learning rate are tuned per dataset.
Adapter Tuning: Adapter modules, inserted after layernorm in the first $N_\mathrm{adapter}$ layers ( $N_\mathrm{adapter} \in [1,12]$ ), are trained while the Transformer backbone is frozen. Each adapter applies:

$\begin{aligned} x̃ &= \mathrm{LayerNorm}(x),\ u &= W_\mathrm{down}\,x̃ + b_\mathrm{down},~~v = \mathrm{GELU}(u),\ \mathrm{Adapter}(x) &= x + W_\mathrm{up}v + b_\mathrm{up}. \end{aligned}$

Only adapter parameters are updated ( $d_a \in [1, 128]$ ).

Domain Fit: No explicit adversarial domain adaptation is employed; instead, inductive biases from soft/hard encoders guide adaptation.

PBT demonstrates rapid convergence and strong performance in data-limited regimes, with MAPE $\lesssim 0.1$ achievable on datasets of moderate size and diversity (Tan et al., 18 Dec 2025).

5. Evaluation and Benchmarking

PBT establishes new benchmarks for both in-distribution and out-of-distribution battery lifetime prediction:

Task	PBT MAPE	Baseline (CPTransformer)	Relative Improvement
Overall	0.143	0.178	19.8%
Seen Conditions	0.138	0.174	20.8%
Unseen Conditions	0.148	0.182	18.7%
Single-Cycle	0.165	0.240	31.3%

In transfer learning on 12 target datasets, PBT-TL reduces MAPE by 22.0% on average. Maximum observed gains (86.9% MAPE reduction) occur in highly heterogeneous datasets with very limited training examples per aging condition. On data outside pretraining coverage (industrial Li-ion, Na-ion, Zn-ion), fine-tuned PBT achieves 27.2%, 11.5%, and 17.2% relative MAPE reductions, respectively (Tan et al., 18 Dec 2025).

Removing expert gating modules substantially degrades generalization. The domain-adaptive BatteryMoE framework is responsible for much of the observed robustness—particularly in "zero-shot" and cross-condition forecasting.

6. Deployment Guidelines and Limitations

Dataset Alignment: New datasets must be resampled to 300 (voltage, current, capacity) points per cycle. Target batteries should fall within the specification space encoded in the pretrained model; otherwise, adapter tuning is recommended.
Sample Efficiency: Accurate fine-tuning is possible with as few as 7–20 cells.
Best Practices: Use early stopping based on validation MAPE and tune regularization parameters for optimal performance.
Limitations: The scarcity of certain aging factors (e.g., manufacturing process variables) in public datasets limits extrapolation to some proprietary cells. BatteryMoE modules can be extended to capture additional factors, and parameter-efficient adaptation (e.g., LoRA, prompt tuning) is proposed as a future enhancement.
Future Directions: Scaling to larger or more diverse battery archives and integrating unified lifetime metrics (e.g., field usage) are open research areas (Tan et al., 18 Dec 2025).

PBT is the first foundation model specifically architected for battery life prediction using domain-knowledge-encoded mixture-of-expert layers validated on the largest benchmark suite. Unlike models such as Battery-Timer (a fine-tuned Timer variant for capacity degradation autoregression) and conventional expert models refined through knowledge distillation (Chan et al., 13 May 2025), PBT combines soft and hard factor-driven gating, Transformer representations, and direct cycle life regression in a transfer-oriented framework. Ablation analyses demonstrate its superiority in both data-rich and data-scarce situations, establishing a universal pathway for robust, accurate, early battery lifetime assessment (Tan et al., 18 Dec 2025).