Papers
Topics
Authors
Recent
Search
2000 character limit reached

MedCPT: Biomedical Coding & Retrieval

Updated 10 May 2026
  • MedCPT is a dual-framework system employing end-to-end deep learning for procedure coding and contrastively pre-trained Transformers for zero-shot biomedical information retrieval.
  • It leverages character-level ICD embeddings and tailored loss functions to achieve high recall rates in multi-label classification of EHR claims.
  • Its retrieval component uses large-scale PubMed click logs with bi-encoder and cross-encoder architectures to enhance semantic search and benchmark performance.

MedCPT refers to two distinct, foundational frameworks in biomedical informatics: (1) an end-to-end deep learning system for automatic coding of procedure codes (Current Procedural Terminology; CPT) from diagnosis codes (ICD-10) in Electronic Health Records (EHRs) (Haq et al., 2017), and (2) a large-scale contrastively pre-trained Transformer model for zero-shot biomedical information retrieval (IR) using PubMed search logs (Jin et al., 2023). Both systems advance the state of the art in their respective domains by leveraging supervised and self-supervised learning at scale, embedding-based architectures, and tailored loss functions for multi-label classification and semantic retrieval.

1. Automatic Procedure Coding in EHRs

The MedCPT-style system for EHRs addresses the challenge of mapping high-cardinality diagnosis code sets and contextual covariates to appropriate CPT procedure codes—a multi-label classification task crucial for billing and clinical workflow efficiency.

Problem Formulation

  • Input: Each insurance claim comprises a variable set of ICD-10 diagnosis codes, patient age, gender, and provider ID. ICDs are encoded by a high-dimensional sparse one-hot vector x0{0,1}Dx^0 \in \{0,1\}^D with D70,000D \approx 70{,}000. After embedding, this yields xRdx\in\mathbb{R}^d.
  • Output: A set of CPT codes, represented as a binary vector y{0,1}Cy\in\{0,1\}^C (C13,000C \approx 13{,}000), or probability vector p=f(x)[0,1]Cp=f(x)\in[0,1]^C with f:Rd[0,1]Cf : \mathbb{R}^d \rightarrow [0,1]^C.

2. Embedded Representations and Aggregation

Character-level ICD Embeddings

  • Each ICD-10 code is a sequence of 7 alphanumeric characters (alphabet size V=36V=36).
  • For each position j=17j=1…7, learn embedding matrices E(j)RV×djE^{(j)}\in\mathbb{R}^{V\times d^j}. If code D70,000D \approx 70{,}0000 has character D70,000D \approx 70{,}0001, its embedding is D70,000D \approx 70{,}0002.
  • The code embedding D70,000D \approx 70{,}0003 is the concatenation D70,000D \approx 70{,}0004 with total dimension D70,000D \approx 70{,}0005.

Claim-level Aggregation

  • For claim ICDs D70,000D \approx 70{,}0006, the claim embedding is D70,000D \approx 70{,}0007.
  • Provider ID is embedded as D70,000D \approx 70{,}0008.
  • Full input: D70,000D \approx 70{,}0009.

3. Network Architecture, Training, and Evaluation

Architecture

The MedCPT procedure coding network comprises four fully-connected (FC) hidden layers plus a sigmoid output layer for multi-label prediction:

  • Hidden sizes: xRdx\in\mathbb{R}^d0.
  • Embedding dimensions: character xRdx\in\mathbb{R}^d1 each (xRdx\in\mathbb{R}^d2), provider xRdx\in\mathbb{R}^d3.
  • Input vector: size xRdx\in\mathbb{R}^d4.
  • All FC layers employ ReLU; only the output uses sigmoid.

Forward computation: xRdx\in\mathbb{R}^d5

Training

  • Loss: Multi-label sigmoid cross-entropy: xRdx\in\mathbb{R}^d6
  • Optimization: Adam optimizer (xRdx\in\mathbb{R}^d7, xRdx\in\mathbb{R}^d8), initial learning rate xRdx\in\mathbb{R}^d9, exponential decay, batch size 128, L2 weight decay y{0,1}Cy\in\{0,1\}^C0, 10–20 epochs with early stopping.

Evaluation

  • Metrics: Precision@K, Recall@K. Recall@3 reaches 0.90—i.e., 90% of true CPT codes for a claim appear in top 3 predictions.
  • Dataset: 2.3 million real, paid claims (training: y{0,1}Cy\in\{0,1\}^C1M, validation: 10%, test: 70,000 claims from unseen providers).
  • Preprocessing: Exclusion of PQRS/standard vaccination codes, denied claims, and age/gender outliers; post-prediction gender-specific adjustments.
  • Runtime: Deep model: y{0,1}Cy\in\{0,1\}^C25 hours on one GPU; rule-based baselines: 20 min (probabilistic) and 48 hours (association-rule mining) (Haq et al., 2017).

4. MedCPT: Pretrained Contrastive Transformers for Biomedical IR

MedCPT for IR leverages large-scale contrastive learning on PubMed click logs for state-of-the-art, domain-sensitive zero-shot retrieval.

System Overview

Bi-encoder Retriever (MedCPT-Retriever)

  • Query encoder (QEnc) and document encoder (DEnc) based on PubMedBERT (“BERT-base”: 12 layers, 768 hidden units, 12 heads, 3072 FFN).
  • Encodings: y{0,1}Cy\in\{0,1\}^C3 and y{0,1}Cy\in\{0,1\}^C4 from y{0,1}Cy\in\{0,1\}^C5 token.
  • Relevance: y{0,1}Cy\in\{0,1\}^C6, with Maximum Inner Product Search (MIPS) at inference.

Cross-encoder Re-ranker (MedCPT-CrossEnc)

  • Also PubMedBERT; input y{0,1}Cy\in\{0,1\}^C7.
  • Relevance: y{0,1}Cy\in\{0,1\}^C8, y{0,1}Cy\in\{0,1\}^C9.
  • Applied to top C13,000C \approx 13{,}0000 candidates from retriever.

5. Pre-training Strategy and Data Utilization

Data Curation

  • Source: 255 million PubMed click pairs from 2020–2022, derived from 167M queries and 23M articles; post-filtering yields 87M informational queries and 17M articles.

Contrastive Pre-training

  • Retriever Loss: In-batch negatives; weighted InfoNCE style: C13,000C \approx 13{,}0001
  • Re-ranker: Hard negatives sampled from retriever ranks 50–200; softmax-NLL loss over +1 positive and C13,000C \approx 13{,}0002 negatives.

6. Experimental Results and Benchmarks

Evaluation Protocol

  • Zero-shot document retrieval: BEIR benchmark (TREC-COVID, NFCorpus, BioASQ, SciFact, SciDocs), RELISH (article similarity), BIOSSES/MedSTS (sentence similarity).
  • Metrics: nDCG@10, MAPC13,000C \approx 13{,}0003, Pearson's C13,000C \approx 13{,}0004.

Performance Summary

Task BM25 cpt-XL MedCPT
TREC-COVID 0.656 0.649 0.709
NFCorpus 0.325 0.407 0.355
BioASQ 0.465 0.553
SciFact 0.665 0.754 0.761
SciDocs 0.158 0.172
Average 0.454 0.510
  • Article similarity: MAP@5 C13,000C \approx 13{,}0005, nDCG@5 C13,000C \approx 13{,}0006 (vs. SciNCL: C13,000C \approx 13{,}0007).
  • Sentence similarity: BIOSSES C13,000C \approx 13{,}0008 (C13,000C \approx 13{,}00090.05 over SciNCL), MedSTS p=f(x)[0,1]Cp=f(x)\in[0,1]^C0 (2nd to BioSentVec p=f(x)[0,1]Cp=f(x)\in[0,1]^C1).
  • All tasks were conducted in a zero-shot setting, with no supervised query-document annotations beyond click logs (Jin et al., 2023).

7. Applications, Limitations, and Extensions

Applications

  • Functionality includes PubMed Best Match ranking, similar-article recommendation, sentence-level search, and retrieval augmentation for LLMs.

Limitations

  • Dense semantic retrieval systems such as MedCPT provide less interpretability than sparse methods (e.g., BM25) and may yield false semantic matches, particularly with ambiguous gene symbols.
  • Both MedCPT frameworks do not incorporate structured ontologies natively, limiting explainability.

Future Directions

  • Hybrid dense–sparse systems are proposed for improved transparency and balanced recall–precision tradeoffs.
  • Better negative sampling and scaling are ongoing directions.
  • Integration of structured biomedical ontologies may enhance interpretability (Jin et al., 2023).

A plausible implication is that MedCPT architectures—whether for procedural prediction within EHRs or for retrieval from the biomedical literature—provide generalizable, highly performant foundations for learning dense semantic representations at scale using real-world clinical or user interaction data. These methods serve as reproducible blueprints for embedding-based automation in both clinical decision support and large-scale biomedical information access.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedCPT.