MedCPT: Biomedical Coding & Retrieval

Updated 10 May 2026

MedCPT is a dual-framework system employing end-to-end deep learning for procedure coding and contrastively pre-trained Transformers for zero-shot biomedical information retrieval.
It leverages character-level ICD embeddings and tailored loss functions to achieve high recall rates in multi-label classification of EHR claims.
Its retrieval component uses large-scale PubMed click logs with bi-encoder and cross-encoder architectures to enhance semantic search and benchmark performance.

MedCPT refers to two distinct, foundational frameworks in biomedical informatics: (1) an end-to-end deep learning system for automatic coding of procedure codes (Current Procedural Terminology; CPT) from diagnosis codes (ICD-10) in Electronic Health Records (EHRs) (Haq et al., 2017), and (2) a large-scale contrastively pre-trained Transformer model for zero-shot biomedical information retrieval (IR) using PubMed search logs (Jin et al., 2023). Both systems advance the state of the art in their respective domains by leveraging supervised and self-supervised learning at scale, embedding-based architectures, and tailored loss functions for multi-label classification and semantic retrieval.

1. Automatic Procedure Coding in EHRs

The MedCPT-style system for EHRs addresses the challenge of mapping high-cardinality diagnosis code sets and contextual covariates to appropriate CPT procedure codes—a multi-label classification task crucial for billing and clinical workflow efficiency.

Problem Formulation

Input: Each insurance claim comprises a variable set of ICD-10 diagnosis codes, patient age, gender, and provider ID. ICDs are encoded by a high-dimensional sparse one-hot vector $x^0 \in \{0,1\}^D$ with $D \approx 70{,}000$ . After embedding, this yields $x\in\mathbb{R}^d$ .
Output: A set of CPT codes, represented as a binary vector $y\in\{0,1\}^C$ ( $C \approx 13{,}000$ ), or probability vector $p=f(x)\in[0,1]^C$ with $f : \mathbb{R}^d \rightarrow [0,1]^C$ .

2. Embedded Representations and Aggregation

Character-level ICD Embeddings

Each ICD-10 code is a sequence of 7 alphanumeric characters (alphabet size $V=36$ ).
For each position $j=1…7$ , learn embedding matrices $E^{(j)}\in\mathbb{R}^{V\times d^j}$ . If code $D \approx 70{,}000$ 0 has character $D \approx 70{,}000$ 1, its embedding is $D \approx 70{,}000$ 2.
The code embedding $D \approx 70{,}000$ 3 is the concatenation $D \approx 70{,}000$ 4 with total dimension $D \approx 70{,}000$ 5.

Claim-level Aggregation

For claim ICDs $D \approx 70{,}000$ 6, the claim embedding is $D \approx 70{,}000$ 7.
Provider ID is embedded as $D \approx 70{,}000$ 8.
Full input: $D \approx 70{,}000$ 9.

3. Network Architecture, Training, and Evaluation

Architecture

The MedCPT procedure coding network comprises four fully-connected (FC) hidden layers plus a sigmoid output layer for multi-label prediction:

Hidden sizes: $x\in\mathbb{R}^d$ 0.
Embedding dimensions: character $x\in\mathbb{R}^d$ 1 each ( $x\in\mathbb{R}^d$ 2), provider $x\in\mathbb{R}^d$ 3.
Input vector: size $x\in\mathbb{R}^d$ 4.
All FC layers employ ReLU; only the output uses sigmoid.

Forward computation: $x\in\mathbb{R}^d$ 5

Training

Loss: Multi-label sigmoid cross-entropy: $x\in\mathbb{R}^d$ 6
Optimization: Adam optimizer ( $x\in\mathbb{R}^d$ 7, $x\in\mathbb{R}^d$ 8), initial learning rate $x\in\mathbb{R}^d$ 9, exponential decay, batch size 128, L2 weight decay $y\in\{0,1\}^C$ 0, 10–20 epochs with early stopping.

Evaluation

Metrics: Precision@K, Recall@K. Recall@3 reaches 0.90—i.e., 90% of true CPT codes for a claim appear in top 3 predictions.
Dataset: 2.3 million real, paid claims (training: $y\in\{0,1\}^C$ 1M, validation: 10%, test: 70,000 claims from unseen providers).
Preprocessing: Exclusion of PQRS/standard vaccination codes, denied claims, and age/gender outliers; post-prediction gender-specific adjustments.
Runtime: Deep model: $y\in\{0,1\}^C$ 25 hours on one GPU; rule-based baselines: 20 min (probabilistic) and 48 hours (association-rule mining) (Haq et al., 2017).

4. MedCPT: Pretrained Contrastive Transformers for Biomedical IR

MedCPT for IR leverages large-scale contrastive learning on PubMed click logs for state-of-the-art, domain-sensitive zero-shot retrieval.

System Overview

Bi-encoder Retriever (MedCPT-Retriever)

Query encoder (QEnc) and document encoder (DEnc) based on PubMedBERT (“BERT-base”: 12 layers, 768 hidden units, 12 heads, 3072 FFN).
Encodings: $y\in\{0,1\}^C$ 3 and $y\in\{0,1\}^C$ 4 from $y\in\{0,1\}^C$ 5 token.
Relevance: $y\in\{0,1\}^C$ 6, with Maximum Inner Product Search (MIPS) at inference.

Cross-encoder Re-ranker (MedCPT-CrossEnc)

Also PubMedBERT; input $y\in\{0,1\}^C$ 7.
Relevance: $y\in\{0,1\}^C$ 8, $y\in\{0,1\}^C$ 9.
Applied to top $C \approx 13{,}000$ 0 candidates from retriever.

5. Pre-training Strategy and Data Utilization

Data Curation

Source: 255 million PubMed click pairs from 2020–2022, derived from 167M queries and 23M articles; post-filtering yields 87M informational queries and 17M articles.

Contrastive Pre-training

Retriever Loss: In-batch negatives; weighted InfoNCE style: $C \approx 13{,}000$ 1
Re-ranker: Hard negatives sampled from retriever ranks 50–200; softmax-NLL loss over +1 positive and $C \approx 13{,}000$ 2 negatives.

6. Experimental Results and Benchmarks

Evaluation Protocol

Zero-shot document retrieval: BEIR benchmark (TREC-COVID, NFCorpus, BioASQ, SciFact, SciDocs), RELISH (article similarity), BIOSSES/MedSTS (sentence similarity).
Metrics: nDCG@10, MAP $C \approx 13{,}000$ 3, Pearson's $C \approx 13{,}000$ 4.

Performance Summary

Task	BM25	cpt-XL	MedCPT
TREC-COVID	0.656	0.649	0.709
NFCorpus	0.325	0.407	0.355
BioASQ	0.465	—	0.553
SciFact	0.665	0.754	0.761
SciDocs	0.158	—	0.172
Average	0.454	—	0.510

Article similarity: MAP@5 $C \approx 13{,}000$ 5, nDCG@5 $C \approx 13{,}000$ 6 (vs. SciNCL: $C \approx 13{,}000$ 7).
Sentence similarity: BIOSSES $C \approx 13{,}000$ 8 ( $C \approx 13{,}000$ 90.05 over SciNCL), MedSTS $p=f(x)\in[0,1]^C$ 0 (2nd to BioSentVec $p=f(x)\in[0,1]^C$ 1).
All tasks were conducted in a zero-shot setting, with no supervised query-document annotations beyond click logs (Jin et al., 2023).

7. Applications, Limitations, and Extensions

Applications

Functionality includes PubMed Best Match ranking, similar-article recommendation, sentence-level search, and retrieval augmentation for LLMs.

Limitations

Dense semantic retrieval systems such as MedCPT provide less interpretability than sparse methods (e.g., BM25) and may yield false semantic matches, particularly with ambiguous gene symbols.
Both MedCPT frameworks do not incorporate structured ontologies natively, limiting explainability.

Future Directions

Hybrid dense–sparse systems are proposed for improved transparency and balanced recall–precision tradeoffs.
Better negative sampling and scaling are ongoing directions.
Integration of structured biomedical ontologies may enhance interpretability (Jin et al., 2023).

A plausible implication is that MedCPT architectures—whether for procedural prediction within EHRs or for retrieval from the biomedical literature—provide generalizable, highly performant foundations for learning dense semantic representations at scale using real-world clinical or user interaction data. These methods serve as reproducible blueprints for embedding-based automation in both clinical decision support and large-scale biomedical information access.

Markdown Report Issue Upgrade to Chat

References (2)

Intelligent EHRs: Predicting Procedure Codes From Diagnosis Codes (2017)

MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedCPT.