Leukemia Bone Marrow Dataset

Updated 27 September 2025

Leukemia bone marrow datasets are collections of cellular, imaging, and clinical data crucial for leukemia detection, classification, and treatment planning.
They integrate modalities such as flow cytometry, microscopy imaging, and clinical parameters to support both traditional statistical analyses and deep learning methods.
These datasets enable precise cell detection, classification, and prognostic modeling, addressing challenges like label imbalance and domain adaptation.

The term “Leukemia Bone Marrow Dataset” encompasses collections of cellular, morphological, molecular, and imaging data derived from bone marrow samples, with the primary objective of supporting the detection, characterization, and classification of leukemia, as well as informing patient stratification and treatment planning. These datasets underpin a diverse range of analytical and diagnostic methods—spanning traditional statistical mechanics, image processing, and contemporary deep learning—applied to bone marrow aspirates, smears, and flow cytometry measurements. The following sections synthesize foundational principles, technical methodologies, representative datasets, diagnostic applications, and future directions as established in the literature.

1. Key Data Sources and Representations

Leukemia bone marrow datasets typically consist of high-resolution images, flow cytometry measurements, and/or tabular clinical parameters, each supporting different analysis modalities:

Flow Cytometry Data: Individual cell-level measurements in multidimensional marker spaces, including forward and side scatter, as well as intensities for a panel of molecular markers (e.g., CD45, FL1, FL2, FL4, FL5) (Vilar, 2014, Gachon et al., 24 Jul 2024).
Microscopy Images: Digital whole slide images (WSI) and/or single-cell crops of bone marrow aspirates and smears, often annotated with bounding boxes and cell type labels. Datasets may contain tens of thousands of manually or algorithmically isolated cells, sometimes with fine-grained categorizations into dozens of classes (Höfener et al., 19 Sep 2025).
Clinical and Laboratory Data: Patient-level information including leukemia diagnosis/subtype, age, gender, and laboratory values (e.g., hematocrit, leukocyte counts, lactate dehydrogenase), useful for integrative diagnostic modeling (Höfener et al., 19 Sep 2025).
Morphological Attributes and Annotations: Recent large-scale datasets incorporate detailed cell attribute annotations (e.g., cell size, nuclear shape, chromatin structure) for explainable AI and clinical interpretability (Rehman et al., 3 Apr 2025).

A notable trend across recent works is the development of comprehensive, multi-domain, publicly available datasets, such as the 246-patient dataset with 45,000+ bounding-boxed cells and detailed clinical metadata (Höfener et al., 19 Sep 2025) or the Large Leukemia Dataset (LLD) with detailed per-cell morphological attributes and multi-resolution image acquisition (Rehman et al., 3 Apr 2025).

2. Analytical and Machine Learning Methodologies

A wide methodological spectrum exists for interpreting leukemia bone marrow datasets, including but not limited to:

Statistical Thermodynamics and Information Theory: Entropy-based methods quantify differences in multidimensional marker distributions between normal and pathological states. The population entropy, defined as

$S_i = -\int P_i(x,q) \ln P_i(x,q) \,dx\,dq,$

is used to derive robust similarity metrics and diagnosis probabilities (e.g., via Kullback-Leibler divergence and “maximum entropy reference” distributions) (Vilar, 2014).

Traditional Segmentation and Feature Engineering: Classic image analysis pipelines employ preprocessing (filtering, contrast enhancement), color and morphological transformations, k-means clustering for ROI segmentation, and extraction of descriptors such as color histograms, geometric features (area, perimeter, roundness, solidity), and textural statistics before feeding to classical classifiers (kNN, SVM, Naïve Bayes) (Kumar et al., 2018, Cao et al., 2018).
End-to-End Deep Learning: Contemporary approaches leverage architectures such as ResNet, VGG, DenseNet, and InceptionResNet, often initialized via transfer learning and fine-tuned on bone marrow imaging data. Object detection (Faster R-CNN, CenterNet, YOLO) and fine-grained cell classification are performed directly on digitized images (Höfener et al., 19 Sep 2025, Tayebi et al., 2021, Meem et al., 2023, Chen et al., 14 Jun 2024).
Attribute and Multi-task Prediction: Recent models incorporate multi-task heads for the joint detection of cell type and morphological attributes, often under sparse or weak supervision (sparse-label learning, pseudo-label selection, triplet losses for feature alignment) (Rehman et al., 3 Apr 2025).
Self-Attention and Transformer-Based Models: Architectures such as SCKansformer introduce learnable activation functions (Kolmogorov-Arnold Network) and global-local multi-head self-attention to enhance representation and interpretability in high-dimensional cell classification tasks (Chen et al., 14 Jun 2024).
Topological and Statistical Analysis: Persistent homology is used to capture complex topological signatures of bone marrow structure in AML progression, summarized using phase-dependent Gaussian mixture models for stage differentiation and prediction (Wang et al., 24 Aug 2024). Optimal transport frameworks support dimensionality reduction and visualization of multi-patient flow cytometry, enabling robust minimal residual disease (MRD) monitoring (Gachon et al., 24 Jul 2024).
Metaheuristics and Feature Selection: Genetic algorithms, binary ant colony optimization, and feature selection techniques (ANOVA, Lasso, Random Forest importance) are employed to optimize deep feature subsets for enhanced diagnostic accuracy and computational efficiency (Rahmani et al., 2 Jun 2024, Ratul et al., 2022).

3. Performance Metrics and Evaluation Protocols

The evaluation of analytic methods on leukemia bone marrow datasets relies on multiple quantitative metrics tailored to the diagnostic task:

Task	Main Metrics	Typical Performance (Cited Papers)
Cell Detection	Precision, Recall, F1, AP	AP=0.96 (Faster R-CNN) (Höfener et al., 19 Sep 2025)
Cell Classification	Top-1 Accuracy, Macro-F1, AUROC	Macro AUROC=0.98, F1=0.61 (33-class)
Diagnosis Prediction	Mean F1-score, AUC	Mean F1=0.90 (Höfener et al., 19 Sep 2025)
Segmentation	Segmentation accuracy, mAP	mAP=0.75 (YOLO) (Tayebi et al., 2021)
Mutation Prediction	Accuracy (mutation class)	85% (4-class, with label noise) (Jain et al., 15 Jun 2025)

Evaluation protocols typically employ patient- or slide-level splitting, stratified to preserve class distributions; separate training, validation, and test sets; and multi-observer consensus annotation for ground truth reliability (Höfener et al., 19 Sep 2025).

Datasets with clinical parameters often support the benchmarking of auxiliary models (e.g., gradient boosting on diagnostic features) alongside image-based models (Höfener et al., 19 Sep 2025, Ratul et al., 2022).

4. Explainability, Annotation Strategies, and Clinical Integration

Recent advances illuminate the importance of annotation granularity, domain shift robustness, and explainability for clinical acceptance:

Attribute-Level Annotation and Morphology Banks: New paradigms couple automated object detection with prediction of per-cell morphology, enabling text-based summaries that facilitate second-opinion reporting and transparency for clinicians (Rehman et al., 3 Apr 2025).
Sparse and Weak Supervision: The adoption of sparse annotations (labeling only a patch per field of view) and weak supervision (training on slide-level labels, as in Multiple Instance Learning) reduces expert time while leveraging pseudo-labeled or high-confidence predictions for learning (Rehman et al., 3 Apr 2025, Manescu et al., 2022).
Consensus Labeling and Quality Control: For large-scale annotation tasks, majority-vote consensus by multiple experts is prioritized to boost label accuracy and ensure robust model training, though this increases annotation time (Höfener et al., 19 Sep 2025).
Domain Adaptation: Datasets constructed from heterogeneous imaging hardware (high-cost and low-cost microscopes/cameras, varying magnifications) and multiple acquisition centers support the evaluation and development of domain adaptation algorithms, critical for real-world deployment (Rehman et al., 3 Apr 2025).
Explainability and Uncertainty Quantification: Techniques such as deep ensembles for uncertainty quantification and attention-based mechanisms for feature attribution facilitate model interpretability, which is essential in high-stakes diagnostic settings (Akter et al., 18 Oct 2024, Maruf et al., 24 Aug 2025).

5. Clinical and Biomedical Applications

Leukemia bone marrow datasets serve key roles across diagnostic and prognostic pipelines:

Automated Differential Cell Counts: Machine learning models now achieve or surpass manual accuracy in generating differential counts from bone marrow smears (~33 cell classes), informing the diagnosis, subtyping, and monitoring of ALL, AML, and CML (Höfener et al., 19 Sep 2025, Tayebi et al., 2021).
Diagnosis Prediction and Outcome Stratification: Patient-level diagnostic prediction models using automated cell counts and clinical laboratory data achieve mean F1 of 0.90, suggesting such pipelines can aid or automate initial diagnosis and triage (Höfener et al., 19 Sep 2025).
Minimal Residual Disease (MRD) Detection: Optimal transport-based dimensionality reduction and clustering of flow cytometry data accurately identify MRD-positive follow-up samples, improving over classical kernel mean embedding methods (Gachon et al., 24 Jul 2024).
Mutation Profile Prediction: Deep learning enables accurate prediction of relevant AML mutations (e.g., NPM1, RUNX1:RUNX1T1, CBFB:MYH11) from single-cell morphology, offering a rapid adjunct to molecular techniques (Jain et al., 15 Jun 2025).
Monitoring and Guidance for Targeted Therapies: Methods supporting CAR-T cell identification (Zhang et al., 2022) and the modeling of chemotherapy response in ALL (Niño-López et al., 2022) demonstrate the applicability of these datasets to therapy selection and personalized medicine.

6. Challenges, Limitations, and Future Directions

Label and Class Imbalance: Rare cell classes remain underrepresented, limiting performance for subtype classification; data augmentation and synthetic image generation (e.g., GANs, diffusion models) are suggested for addressing this challenge (Maruf et al., 24 Aug 2025).
Annotation Burden: Multi-expert consensus labeling, while boosting data quality, is labor intensive; sparse annotation and self-/weak-supervised learning reduce this bottleneck (Rehman et al., 3 Apr 2025).
Domain Shift and Generalizability: Ensuring performance across different imaging modalities and centers necessitates algorithmic robustness to domain shift and thorough validation on external datasets (Rehman et al., 3 Apr 2025).
Model Interpretability: Integration of interpretable attention mechanisms, explicit morphological attributes, and uncertainty quantification is a priority for clinical trust (Akter et al., 18 Oct 2024, Maruf et al., 24 Aug 2025).
Scaling and Workflows: With large-scale, public, well-annotated datasets now available, future directions focus on integration with clinical workflows, cloud-based and point-of-care deployment, and further automation of the entire diagnostic pipeline—from image acquisition to risk stratification (Höfener et al., 19 Sep 2025, Maruf et al., 24 Aug 2025).

7. Representative Datasets and Availability

Dataset Name / Paper	Key Features	Availability
Comprehensive Pediatric Dataset	246 patients, >45,000 cells, 33 classes	Public (Höfener et al., 19 Sep 2025)
Large Leukemia Dataset (LLD)	28.9K images, full/sparse annotation, 14 types	Public (Rehman et al., 3 Apr 2025)
BMCD-FGCD	92,335 images, ~40 types, H&E stained	Public (Chen et al., 14 Jun 2024)
BMEC	5,666 erythroid cells	Public (Wang et al., 2022)
C-NMC 2019	15,114 BMA images, ALL challenge	Public (Rahmani et al., 2 Jun 2024)
Bordeaux FCM, HIPC	Flow cytometry, AML diagnosis & MRD	Mixed, see paper (Gachon et al., 24 Jul 2024)

These datasets—and associated code—are referenced in the corresponding works and are available for further research and development, providing a robust foundation for the next generation of AI-enabled leukemia diagnostics and research.