ICD-Bench: ICD Coding Evaluation Framework

Updated 22 July 2025

ICD-Bench is an evaluation framework and benchmark suite that standardizes data preprocessing, evaluation protocols, and reproducible research in automated ICD coding.
It leverages high-quality EHR data (e.g., MIMIC-III/IV) to support full-code and top-50 settings, using metrics such as AUC, F1, and MAP for performance assessment.
The suite fosters innovation by enabling fair model comparisons and integration of advanced architectures, addressing challenges like label imbalance and rare code prediction.

ICD-Bench is an evaluation framework and suite of benchmark datasets designed to advance research and practice in automated International Classification of Diseases (ICD) coding using electronic health records (EHRs) and clinical text. ICD coding, which assigns medical codes to diagnoses and procedures, is central to healthcare informatics, clinical research, billing, and epidemiological surveillance. ICD-Bench standardizes data preprocessing, model evaluation, and comparison protocols across multiple ICD coding tasks and datasets, thereby facilitating reproducible research and accelerating the development and fair assessment of novel ICD coding methods.

1. Benchmark Construction and Data Standardization

ICD-Bench builds upon high-quality public EHR data, most notably the MIMIC series (MIMIC-III and MIMIC-IV), to provide rigorous testbeds for automated ICD coding (Nguyen et al., 2023). The benchmark suite incorporates both ICD-9 and ICD-10 coding tasks and supports multiple benchmark variants:

Full-Code Settings: Include all ICD codes present in the corpus, reflecting the real-world, long-tailed label distribution (e.g., over 26,000 unique ICD-10 codes in MIMIC-IV).
Top-50 Settings: Restrict to the 50 most frequent ICD codes, creating a more balanced but less realistic multi-label classification setting.

Data split strategies ensure patient-level independence by splitting on unique patient or encounter identifiers, and documentation details (such as hadm_ids for MIMIC-IV) for each partition are provided to maximize reproducibility. Text preprocessing protocols (e.g., NLTK-based tokenization, stopword removal, truncation) are explicitly detailed and standardized (Nguyen et al., 2023).

Label preprocessing is handled with explicit algorithmic rules so that, for instance, the parent code for ICD-9 diagnoses is the first four characters (if the code starts with “E”) or three otherwise; for ICD-10 codes, the parent is defined by the first three characters. These transformations facilitate hierarchical modeling and consistent evaluation.

2. Model Evaluation Protocols and Metrics

ICD-Bench standardizes evaluation criteria to ensure fair comparison across a wide variety of modeling approaches, including traditional deep learning architectures, attention-based models, and transformer-based pretrained LLMs:

Macro and Micro AUC/F1 Scores: Capture overall and code frequency-weighted predictive performance.
Precision@K: Reflects the proportion of correct codes within the top-K predictions per record, aligning with practical use cases where coders check candidate code lists.
Mean Average Precision (MAP): Evaluates ranking quality across the set of predictions.
Additional Measures: Document-level precision/recall, code-level metrics, and ranking metrics such as NDCG@12 have been adopted in recent ICD-Bench variants, motivated by their clinical relevance for code prioritization and billing workflows (DeYoung et al., 2022).

Tables of baseline results are published for established architectures, enabling researchers to benchmark new algorithms against leading systems under identical experimental conditions.

3. Baseline Models and Architectural Innovations

ICD-Bench encompasses a diverse array of strong baseline and state-of-the-art models, enabling rigorous comparative studies:

Model/Framework	Key Features	Strengths/Considerations
CAML	Convolutional attention	Fast, label-specific focus, standard for early ICD NLP (Nguyen et al., 2023)
LAAT	Label-specific BiLSTM+attention	Strong on contextual signal; extensible to hierarchical training (Vu et al., 2020)
JointLAAT	Hierarchical joint learning	Improves macro-F1, especially for rare codes
PLM-ICD	Fine-tuned transformer + segment pooling + label attention	Addresses input-length constraints, large label space, and domain adaptation; strong empirical results (Huang et al., 2022)
MSMN	Multi-synonym matching	Effective for rare codes using synonym augmentation
Entity-Anchored/PAAT	Local windowed/entity context; partition-based attention	Enhanced generalization to unseen codes and dispersed information (DeYoung et al., 2022, Kim et al., 2022)

The benchmark also provides source code for data splitting, preprocessing, and baseline model implementation, fostering widespread adoption and further extension by the research community (Nguyen et al., 2023).

4. Addressing Label Imbalance and Rare Code Prediction

ICD-Bench datasets reflect the naturally long-tailed distribution of ICD codes, where a small subset of codes is very frequent and the majority are rare. This presents major challenges for model learning and evaluation.

Several incorporated models explicitly address this imbalance:

Hierarchical Learning: By organizing codes using their parent structure and joint training for normalized and raw codes, models better exploit shared information and boost recall on infrequent codes (Vu et al., 2020).
Generalized Zero-Shot/Few-Shot Coding: Adversarial generative modules synthesize pseudo-labeled features for zero-shot codes (those without training examples) using ICD code descriptions, hierarchy, and keyword reconstruction losses to maintain semantic alignment (Song et al., 2019). Integration of these modules into ICD-Bench pipelines broadens the evaluation space to rare and previously unseen codes.
Knowledge-Enhanced Models: General knowledge injection frameworks (e.g., GKI-ICD) infuse training with descriptions, synonyms, and hierarchy, improving both overall and rare-code performance without specialized modules (Zhang et al., 24 May 2025).

ICD-Bench thus sets a systematic standard for measuring and reporting rare code accuracy, macro-F1, and related metrics.

5. Reproducibility, Accessibility, and Community Impact

A defining feature of ICD-Bench is its commitment to open, reproducible science:

Public Release of Data Splits and Processing Pipelines: All splits (e.g., train/val/test hadm_ids) and data filtering steps are provided for researchers with raw data access.
Baseline Model Scripts: Recommended hyperparameters, training loops, and evaluation scripts matching the published benchmarks are released.
Supporting Expansion: The open nature of code and benchmark documentation enables rapid integration of new models, metrics, and data sources (e.g., non-English EHRs such as RuCCoD for Russian (Nesterov et al., 28 Feb 2025)), facilitating broad adoption.

This reproducibility directly accelerates innovation in algorithmic development, robust model comparison, and helps clarify incremental performance gains.

6. Implications for Model Development and Clinical Integration

ICD-Bench not only drives machine learning advances but also informs practical system design in healthcare:

Consistency with Real-World Workflows: The emphasis on document-level ranking, code hierarchies, and rare code detection mirrors the priorities of clinical coders and billing staff.
Generality Across Language and Code Systems: By enabling benchmarks on both ICD-9 and ICD-10 and supporting multilingual development, ICD-Bench is suitable for international adaptation (Nesterov et al., 28 Feb 2025).
Guidance for Error Analysis: Studies leveraging ICD-Bench highlight common gaps in coding accuracy—such as undercoding, code confusion systematically introduced by human annotators, and demographic/expert bias. This informs the future design of AI augmentation tools for clinical coders and health informatics policy recommendations (Kim et al., 2021, Zhang et al., 18 Oct 2024, Pan et al., 31 Mar 2025).

7. Future Directions

Emerging research trends suggest multiple future enhancements and applications for ICD-Bench:

Multi-Modal Benchmarks: Recent models integrate structured EHR data (e.g., labs, medications) with free-text via attention-based fusion mechanisms, a trend likely to be reflected in future ICD-Bench benchmark tasks (Liu et al., 2023).
LLM and Multi-Agent Modeling: LLM-based agent frameworks that mimic real-world multi-role coding workflows (e.g., physician, coder, reviewer, patient) offer new, explainable approaches and may motivate the next generation of benchmark scenarios (Li et al., 1 Apr 2024).
Bias and Fairness Audits: Structural causal model-based benchmarks for evaluating and mitigating bias due to demographics and expert labeling multiplicity are being explored and could become standard in ICD-Bench extensions (Zhang et al., 18 Oct 2024).
Joint Task and Expanded Labeling: Combining ICD coding with related clinical tasks (e.g., comorbidity extraction, outcome prediction) and evaluating models under joint task settings may further align benchmark design with true clinical utility.

In summary, ICD-Bench provides a comprehensive, reproducible, and evolving framework for the rigorous evaluation of automated ICD coding systems. Through its standardized datasets, evaluation metrics, and integration of leading model architectures, it has become a cornerstone resource for both academic and applied research in clinical NLP and healthcare data science.