MedCPT: Biomedical Coding & Retrieval
- MedCPT is a dual-framework system employing end-to-end deep learning for procedure coding and contrastively pre-trained Transformers for zero-shot biomedical information retrieval.
- It leverages character-level ICD embeddings and tailored loss functions to achieve high recall rates in multi-label classification of EHR claims.
- Its retrieval component uses large-scale PubMed click logs with bi-encoder and cross-encoder architectures to enhance semantic search and benchmark performance.
MedCPT refers to two distinct, foundational frameworks in biomedical informatics: (1) an end-to-end deep learning system for automatic coding of procedure codes (Current Procedural Terminology; CPT) from diagnosis codes (ICD-10) in Electronic Health Records (EHRs) (Haq et al., 2017), and (2) a large-scale contrastively pre-trained Transformer model for zero-shot biomedical information retrieval (IR) using PubMed search logs (Jin et al., 2023). Both systems advance the state of the art in their respective domains by leveraging supervised and self-supervised learning at scale, embedding-based architectures, and tailored loss functions for multi-label classification and semantic retrieval.
1. Automatic Procedure Coding in EHRs
The MedCPT-style system for EHRs addresses the challenge of mapping high-cardinality diagnosis code sets and contextual covariates to appropriate CPT procedure codes—a multi-label classification task crucial for billing and clinical workflow efficiency.
Problem Formulation
- Input: Each insurance claim comprises a variable set of ICD-10 diagnosis codes, patient age, gender, and provider ID. ICDs are encoded by a high-dimensional sparse one-hot vector with . After embedding, this yields .
- Output: A set of CPT codes, represented as a binary vector (), or probability vector with .
2. Embedded Representations and Aggregation
Character-level ICD Embeddings
- Each ICD-10 code is a sequence of 7 alphanumeric characters (alphabet size ).
- For each position , learn embedding matrices . If code 0 has character 1, its embedding is 2.
- The code embedding 3 is the concatenation 4 with total dimension 5.
Claim-level Aggregation
- For claim ICDs 6, the claim embedding is 7.
- Provider ID is embedded as 8.
- Full input: 9.
3. Network Architecture, Training, and Evaluation
Architecture
The MedCPT procedure coding network comprises four fully-connected (FC) hidden layers plus a sigmoid output layer for multi-label prediction:
- Hidden sizes: 0.
- Embedding dimensions: character 1 each (2), provider 3.
- Input vector: size 4.
- All FC layers employ ReLU; only the output uses sigmoid.
Forward computation: 5
Training
- Loss: Multi-label sigmoid cross-entropy: 6
- Optimization: Adam optimizer (7, 8), initial learning rate 9, exponential decay, batch size 128, L2 weight decay 0, 10–20 epochs with early stopping.
Evaluation
- Metrics: Precision@K, Recall@K. Recall@3 reaches 0.90—i.e., 90% of true CPT codes for a claim appear in top 3 predictions.
- Dataset: 2.3 million real, paid claims (training: 1M, validation: 10%, test: 70,000 claims from unseen providers).
- Preprocessing: Exclusion of PQRS/standard vaccination codes, denied claims, and age/gender outliers; post-prediction gender-specific adjustments.
- Runtime: Deep model: 25 hours on one GPU; rule-based baselines: 20 min (probabilistic) and 48 hours (association-rule mining) (Haq et al., 2017).
4. MedCPT: Pretrained Contrastive Transformers for Biomedical IR
MedCPT for IR leverages large-scale contrastive learning on PubMed click logs for state-of-the-art, domain-sensitive zero-shot retrieval.
System Overview
Bi-encoder Retriever (MedCPT-Retriever)
- Query encoder (QEnc) and document encoder (DEnc) based on PubMedBERT (“BERT-base”: 12 layers, 768 hidden units, 12 heads, 3072 FFN).
- Encodings: 3 and 4 from 5 token.
- Relevance: 6, with Maximum Inner Product Search (MIPS) at inference.
Cross-encoder Re-ranker (MedCPT-CrossEnc)
- Also PubMedBERT; input 7.
- Relevance: 8, 9.
- Applied to top 0 candidates from retriever.
5. Pre-training Strategy and Data Utilization
Data Curation
- Source: 255 million PubMed click pairs from 2020–2022, derived from 167M queries and 23M articles; post-filtering yields 87M informational queries and 17M articles.
Contrastive Pre-training
- Retriever Loss: In-batch negatives; weighted InfoNCE style: 1
- Re-ranker: Hard negatives sampled from retriever ranks 50–200; softmax-NLL loss over +1 positive and 2 negatives.
6. Experimental Results and Benchmarks
Evaluation Protocol
- Zero-shot document retrieval: BEIR benchmark (TREC-COVID, NFCorpus, BioASQ, SciFact, SciDocs), RELISH (article similarity), BIOSSES/MedSTS (sentence similarity).
- Metrics: nDCG@10, MAP3, Pearson's 4.
Performance Summary
| Task | BM25 | cpt-XL | MedCPT |
|---|---|---|---|
| TREC-COVID | 0.656 | 0.649 | 0.709 |
| NFCorpus | 0.325 | 0.407 | 0.355 |
| BioASQ | 0.465 | — | 0.553 |
| SciFact | 0.665 | 0.754 | 0.761 |
| SciDocs | 0.158 | — | 0.172 |
| Average | 0.454 | — | 0.510 |
- Article similarity: MAP@5 5, nDCG@5 6 (vs. SciNCL: 7).
- Sentence similarity: BIOSSES 8 (90.05 over SciNCL), MedSTS 0 (2nd to BioSentVec 1).
- All tasks were conducted in a zero-shot setting, with no supervised query-document annotations beyond click logs (Jin et al., 2023).
7. Applications, Limitations, and Extensions
Applications
- Functionality includes PubMed Best Match ranking, similar-article recommendation, sentence-level search, and retrieval augmentation for LLMs.
Limitations
- Dense semantic retrieval systems such as MedCPT provide less interpretability than sparse methods (e.g., BM25) and may yield false semantic matches, particularly with ambiguous gene symbols.
- Both MedCPT frameworks do not incorporate structured ontologies natively, limiting explainability.
Future Directions
- Hybrid dense–sparse systems are proposed for improved transparency and balanced recall–precision tradeoffs.
- Better negative sampling and scaling are ongoing directions.
- Integration of structured biomedical ontologies may enhance interpretability (Jin et al., 2023).
A plausible implication is that MedCPT architectures—whether for procedural prediction within EHRs or for retrieval from the biomedical literature—provide generalizable, highly performant foundations for learning dense semantic representations at scale using real-world clinical or user interaction data. These methods serve as reproducible blueprints for embedding-based automation in both clinical decision support and large-scale biomedical information access.