ProtactiniumBERT: Transparent Clinical NLP
- ProtactiniumBERT is a BERT-base encoder designed for clinical NLP, integrating PHI de-identification and ICD-9-CM extraction using fully synthetic data.
- It employs a dual-head architecture with token-classification and multi-label classification to achieve superior performance over ClinicalBERT baselines.
- Its comprehensive TeMLM transparency suite ensures complete model provenance, repeatable benchmarking, and compliance with strict data governance.
ProtactiniumBERT is a BERT-base–style encoder designed for high-transparency clinical natural language processing, tailored to synthetic healthcare documentation for two key tasks: protected health information (PHI) de-identification and ICD-9-CM diagnosis code extraction. Featuring approximately 100 million parameters, this model serves as a reproducible and auditable reference point for privacy-preserving medical NLP under data-governance regimes that restrict access to real clinical text. ProtactiniumBERT’s release artifacts, aligned with the TeMLM transparency suite, exemplify comprehensive data and model provenance for repeatable benchmarking and deployment readiness audits (Imanov et al., 27 Jan 2026).
1. Model Architecture and Tokenization
ProtactiniumBERT adopts a canonical BERT-base Transformer encoder architecture, consisting of 12 Transformer layers, each endowed with 12 self-attention heads and a hidden size of 768. Input sequences are tokenized using a 30,000-piece uncased WordPiece vocabulary. Pretraining supports sequence lengths up to 512 tokens, with longer clinical notes processed via a sliding-window extraction scheme. At fine-tuning, two output “heads” are appended:
- Token-classification head: BIO-labeled sequence tagging for 10 PHI entity types.
- Multi-label classification head: Binary outputs for the top-50 ICD-9-CM codes (one-vs-rest setup).
This configuration maintains compatibility with popular biomedical BERT checkpoints, facilitating continual pretraining and clinical adaptation.
2. Pretraining and Fine-tuning Data
The model initializes from a pre-existing biomedical checkpoint (e.g., BioBERT or ClinicalBERT) and is further pretrained using the Technetium-I corpus: a large-scale, elaborately annotated, fully synthetic English clinical-note dataset. Technetium-I provides millions of notes with 7.74 million PHI mentions across categories such as NAME, PROFESSION, LOCATION, AGE, DATE, CONTACT, ID, HOSPITAL, and DEVICE. The diagnosis labels are limited to the 50 most frequent ICD-9-CM codes. The standard dataset split is 70/15/15 for train/validation/test, enforced at the patient level with similarity-based leakage auditing. ProtactiniumBERT’s public reference results are entirely based on this synthetic corpus; validation on institutional EHR data remains a formal prerequisite for real-world deployment (Imanov et al., 27 Jan 2026).
3. Training Objectives and Optimization Procedures
Pretraining continues with dynamic masked language modeling (MLM). Downstream fine-tuning combines two supervised objectives:
- Token-classification loss (PHI de-ID):
where is the number of tokens, is the number of entity classes, are ground-truth labels, and are predicted probabilities.
- Multi-label classification loss (ICD coding):
for ICD codes. Optimization in both cases uses AdamW with learning-rate warmup and linear decay. Full hyperparameter configurations (batch size, learning rate, number of epochs) are version-controlled and logged in the TeMLM-Provenance artifact. Sliding-window processing enables handling of arbitrarily long input sequences.
4. Evaluation Metrics and Benchmark Performance
Evaluation adheres to micro- and macro-averaged metrics standard in clinical NLP:
| Task | Micro-F1 | Macro-F1 | Precision@5 | ClinicalBERT Baseline (Micro-F1/Macro-F1/P@5) |
|---|---|---|---|---|
| PHI de-identification | 0.984 | — | — | 0.976 |
| ICD-9-CM top-50 coding | 0.760 | 0.640 | 0.790 | 0.730 / 0.600 / 0.770 |
- PHI de-identification is evaluated at the token level (BIO-labeled).
- ICD-9-CM coding is assessed using micro- and macro-F1, with Precision@5 indicating the fraction of true codes among the top five predictions per note.
Formally:
- Precision
- Recall
- F1
Both micro- and macro-averaging strategies are reported, with micro-F1 summing true/false positives and negatives across all codes prior to scoring. ProtactiniumBERT consistently outperforms ClinicalBERT in both task domains under synthetic evaluation (Imanov et al., 27 Jan 2026).
5. Transparency and Provenance Framework
All ProtactiniumBERT releases are accompanied by a complete TeMLM transparency artifact suite:
- TeMLM-Datasheet: Documents dataset version, synthetic generation protocol, PHI annotation logic, de-identification assumptions, patient-level splits, leakage rates, and annotation reliability assessments.
- TeMLM-Card: Summarizes architectural details, intended clinical scenarios (PHI redaction, code assignment), prohibited applications, explicit provenance (dataset hashes, code commits), and limitations. Incorporates monitoring protocols (for accuracy drift and rollback).
- TeMLM-Provenance: Uses a PROV-style event graph (JSON), mapping every workflow step (data extraction, annotation, preprocessing, training, evaluation) to artifact hashes for traceability.
- Conformance Checklist: Enforces 100% data and documentation completion, audit of de-ID threat model (sampling-based verification), leakage analysis (near-duplicate curve), reproducibility metadata (random seeds, environment capture).
This comprehensive transparency regime ensures claims are auditable and reproducible, lowering the administrative threshold for third-party validation and independent benchmarking under constrained data-governance conditions (Imanov et al., 27 Jan 2026).
6. Limitations and Deployment Considerations
ProtactiniumBERT’s reference results are derived exclusively from synthetic clinical text; no real EHR data are employed for model selection or benchmarking. As a consequence, salient aspects such as institution-specific vocabulary, rare identifier types, and longitudinal record complexity are not fully represented. It is emphasized that validation on governed real-world EHR corpora is necessary before operational deployments. The modularity of the transparency suite provides a replicable template for institutional adaptation, audit, and continuous model governance, but does not eliminate the requirement for local data validation and oversight (Imanov et al., 27 Jan 2026).
7. Significance and Application Scope
ProtactiniumBERT functions as a transparency-first, reproducible baseline for de-identification and diagnostic coding tasks in medical NLP. Its interoperable BERT-base backbone and well-specified release bundle facilitate external audit, regulatory assessment, and multi-site studies in privacy-preserving settings. A plausible implication is that the TeMLM artifact methodology adopted for ProtactiniumBERT could inform best practices in clinical NLP, particularly where direct sharing of sensitive text is infeasible but reproducible research and independent validation remain paramount (Imanov et al., 27 Jan 2026).