AI Fracture Detection Systems

Updated 5 December 2025

AI-based fracture detection systems are advanced computational pipelines leveraging deep learning and machine learning to accurately identify, localize, and grade fractures in various imaging modalities.
They integrate diverse preprocessing techniques and architectures such as CNNs, transformers, and ensemble detectors to achieve high accuracy and real-time performance in clinical and industrial settings.
Future research is focused on addressing data imbalance, improving ordinal grading and interpretability, and validating models through multi-center studies for broader clinical adoption.

AI-based fracture detection systems are computational pipelines employing artificial intelligence—primarily deep learning and machine learning—for the identification, localization, grading, and characterization of bone fractures in radiographic or tomographic images. They incorporate a range of architectures and training paradigms, from lightweight convolutional neural networks (CNNs) and transformer-based classifiers to ensemble object detectors and domain-specific representation learning. These systems are now integral to research in automated musculoskeletal and trauma radiology, quality control in industrial inspection, and clinical-decision support for fracture triage.

1. Imaging Modalities, Datasets, and Preprocessing

AI-based fracture detection predominantly targets plain radiographs (X-ray), computed tomography (CT), and digital images of manufactured components. Most medical imaging systems utilize established public datasets, such as GRAZPEDWRI-DX (pediatric wrist X-rays, over 20,000 images) (Ferdi, 31 Dec 2024, Chien et al., 17 Mar 2024, Ju et al., 2023, Ahmed et al., 17 Jul 2024, Sun et al., 27 Sep 2025), VerSe (vertebral CTs, N=1,283) (Husseini et al., 2020), and FracAtlas (multi-region musculoskeletal X-rays, N=4,083) (Hassan et al., 7 Sep 2025). Industrial systems typically assemble production-line images using high-resolution area scan cameras (Shetty, 2019).

Preprocessing pipelines vary according to image modality and target:

Standardized cropping and rescaling (e.g., 224×224 to 1024×1024) for anatomical focus.
Histogram equalization, CLAHE, gamma correction, and denoising to maximize edge clarity (Haque et al., 31 Jul 2025, Hardalaç et al., 2021, Yang et al., 2019).
Edge- and contour extraction (Canny, Sobel, Laplacian filters, Hough or contour transforms) for classical pipelines (Yang et al., 2019, Yang et al., 2019, Shetty, 2019).
Data augmentation (geometric, photometric, MixUp, mosaic) at training time to increase effective sample size and reduce data imbalance (Chien et al., 17 Mar 2024, Ferdi, 31 Dec 2024, Ju et al., 2023, Ahmed et al., 17 Jul 2024, Sun et al., 27 Sep 2025).

2. Model Architectures and Learning Paradigms

The systems implement a diversity of architectures, matched to the detection objective:

End-to-end CNN classifiers: Both custom shallow CNNs for lightweight deployments (Hassan et al., 7 Sep 2025) and transfer-learning from large-scale models such as EfficientNet-B4 (Sato et al., 2020), DenseNet-169 (Krogue et al., 2019), VGG-19 (Haque et al., 31 Jul 2025), and Inception V3 (Shetty, 2019) are common. These typically output binary (fracture/no fracture) or multiclass (fracture subtypes) labels.
Single-stage object detectors: Anchor-based or anchor-free YOLO variants (YOLOv5/v6/v7/v8/v9/v11, YOLOX) (Chien et al., 17 Mar 2024, Ju et al., 2023, Ferdi, 31 Dec 2024, Ahmed et al., 17 Jul 2024, Sun et al., 27 Sep 2025) and RetinaNet derivatives (Hardalaç et al., 2021, Krogue et al., 2019) provide bounding-box localization with per-object fracture classification.
Two-stage detectors: Region proposal–based methods (Faster R-CNN, Dynamic R-CNN, SABL) generate candidate regions and refine localization/classification in a decoupled post-processing head (Hardalaç et al., 2021, M et al., 17 Jul 2025, Krogue et al., 2019).
Metric learning/ordinal representation: For tasks with ordinal grading (e.g., vertebral fractures by Genant scale), representation learning pipelines employ quadruplet or triplet metric loss to embed explicit severity constraints (Husseini et al., 2020).

Signal-processing-derived feature schemes are also used, notably:

Line/contour feature extraction, followed by ANN classification (standard and adaptive differential parameter optimization (ADPO) for line detection (Yang et al., 2019); contour histogram features—CHFB (Yang et al., 2019)).
Hybrid/ensemble architectures, combining multiple detectors (e.g., Faster R-CNN, EfficientDet, RF-DETR) with post-hoc fusion (Soft-NMS, weighted box fusion (WBF), non-maximum weighted (NMW) fusion) for performance maximization (M et al., 17 Jul 2025, Hardalaç et al., 2021).

3. Loss Functions, Optimization, and Training Protocols

AI-based fracture detection models optimize variations of cross-entropy and regression losses:

Standard binary/multiclass cross-entropy for classifier heads (Haque et al., 31 Jul 2025, Sato et al., 2020, Shetty, 2019).
Focal loss and smooth L1 (Huber) loss for object detection tasks (Hardalaç et al., 2021, M et al., 17 Jul 2025, Krogue et al., 2019, Chien et al., 17 Mar 2024).
CIoU/DIoU loss between predicted and ground-truth bounding boxes (Chien et al., 17 Mar 2024, Ahmed et al., 17 Jul 2024, Ju et al., 2023, Ferdi, 31 Dec 2024).
Metric/ordinal loss: For Genant-graded vertebral fractures, grading loss imposes ordinal separation via margin-based hinge loss on quadruplets, augmenting standard contrastive or triplet losses (Husseini et al., 2020).

Optimization is typically by Adam or SGD, with learning-rate decay (cosine/step-wise), early stopping, and augmentations. For example, (Ferdi, 31 Dec 2024) follows a one-cycle learning rate schedule; (Sato et al., 2020, Haque et al., 31 Jul 2025) use the Adam optimizer with warmup and patience-based early stopping; mixup and test-time augmentation are frequently included for robust generalization (Raisuddin et al., 2020).

4. Evaluation Metrics and Comparative Performance

Performance is quantified by both classification and detection metrics:

Accuracy, precision, recall (sensitivity), specificity, F1-score, AUC-ROC (Sato et al., 2020, Haque et al., 31 Jul 2025, Gale et al., 2017, Ahmed et al., 17 Jul 2024, Hassan et al., 7 Sep 2025).
Average precision (AP) and mean average precision ([email protected], [email protected]:0.95) for object detection tasks, computed via COCO-style or VOC-style protocols (Hardalaç et al., 2021, Ju et al., 2023, Chien et al., 17 Mar 2024, M et al., 17 Jul 2025, Ferdi, 31 Dec 2024, Sun et al., 27 Sep 2025).

Representative performance values (test set): | System | Modality | [email protected] | Sensitivity | F1 | AUC | Reference | |---------------------- |--------------|---------|-------------|-------|-------|--------------| | G-YOLOv11 (large) | X-ray/Wrist | 0.535 | — | — | — | (Ferdi, 31 Dec 2024) | | YOLOv9-E (1024 px) | X-ray/Wrist | 0.657 | — | 0.66 | — | (Chien et al., 17 Mar 2024) | | Fracture-YOLO | X-ray/Wrist | 0.653 | — | — | — | (Sun et al., 27 Sep 2025) | | DeepWrist | X-ray/Wrist | — | — | — | 0.84* | (Raisuddin et al., 2020) | | DenseNet-169, Krogue | X-ray/Hip | — | 0.927 | 0.938 | 0.973 | (Krogue et al., 2019) | | EfficientNet-B4, Sato | X-ray/Hip | — | 0.952 | 0.961 | 0.99 | (Sato et al., 2020) | | Custom CNN, FracAtlas | X-ray/Multi | — | 0.88 | 0.91 | — | (Hassan et al., 7 Sep 2025) | | CHFB Contour-ANN | X-ray/Long | — | — | — | 0.83 | (Yang et al., 2019) |

* DeepWrist AUC drops to 0.84 for CT-confirmed subtle cases; on routine cases, AUC reaches 0.99 (Raisuddin et al., 2020).

Human-level or superior performance is claimed in several studies: DenseNet-169 achieves parity or exceeds expert and resident readers in hip fracture detection (Krogue et al., 2019); EfficientNet-B4 approaches subspecialist-level sensitivity (Sato et al., 2020); ensemble methods (WFD_C, NMW) can yield F1 ≈ 0.96 (M et al., 17 Jul 2025, Hardalaç et al., 2021). Performance on rarely represented classes remains suboptimal across all models (Chien et al., 17 Mar 2024, Ju et al., 2023, Ferdi, 31 Dec 2024).

5. Interpretability, Workflow Integration, and Clinical Utility

AI-based systems increasingly provide model interpretability for trust and regulatory purposes. Grad-CAM heatmaps and related saliency-map techniques permit ROI-level validation by clinicians (Haque et al., 31 Jul 2025, Sato et al., 2020, Krogue et al., 2019, Raisuddin et al., 2020). Explicit geometric keypoint predictions for vertebral fractures yield directly verifiable measures (anterior/middle/posterior heights) in line with clinical standards (Pisov et al., 2020). Embedding-space visualizations (t-SNE) empirically demonstrate the effect of specialized loss functions (e.g., grading loss yields separable, ordinal clusters) (Husseini et al., 2020).

Deployment and workflow integration are addressed by several systems:

Real-time web and desktop apps for surgeon/radiologist use (<0.5 s/image throughput) (Haque et al., 31 Jul 2025, Ju et al., 2023, Ferdi, 31 Dec 2024).
Embedded inference on GPUs or edge devices, enabling point-of-care triage without specialist access (Ferdi, 31 Dec 2024, Hassan et al., 7 Sep 2025).
PACS (Picture Archiving and Communication Systems) integration for automated triage or reporting (Krogue et al., 2019, Sato et al., 2020, Nejad et al., 2023).
Regulatory/validation requirements are highlighted, particularly for medical deployment (Ferdi, 31 Dec 2024, Ju et al., 2023, M et al., 17 Jul 2025).

6. Limitations, Open Challenges, and Future Directions

Current systems encounter critical limitations:

Data imbalance and limited annotation scope: Under-representation of rare subtypes and classes impairs generalization and recall, especially for subtle or multi-class fracture scenarios (Chien et al., 17 Mar 2024, Ferdi, 31 Dec 2024, Sun et al., 27 Sep 2025).
Dataset size and external validation: Most studies are single-center, with few cross-site evaluations, threatening external generalizability (Krogue et al., 2019, Raisuddin et al., 2020, Ju et al., 2023).
Grading/ordinal assessment: Many systems perform binary classification only, overlooking clinically important gradations (e.g., Genant scale for vertebral fractures). Exceptions include explicit ordinal loss pipelines (Husseini et al., 2020, Pisov et al., 2020).
Interpretability: While Grad-CAM and similar tools are increasingly common, many high-performing detectors lack transparent, clinico-anatomical rationale for predictions (Haque et al., 31 Jul 2025, Shetty, 2019).
Occult/subtle fracture detection: Models perform robustly on “routine” cases but show large drops in AP/AUC for CT-only confirmed fractures, with poor OOD/noise-uncertainty quantification (Raisuddin et al., 2020).

Active research trajectories include:

Ordinal/graded loss functions and meta-learning for rapid domain adaptation (Husseini et al., 2020).
Attention mechanisms to enhance focus on small or critical fracture patterns (e.g., CRSelector, Scale-Aware heads) (Sun et al., 27 Sep 2025).
Ensemble fusion across architectural and detector types, balancing recall and localization accuracy (M et al., 17 Jul 2025, Hardalaç et al., 2021).
3D/temporal extension (to multi-view, multi-modality, or longitudinal CT/X-ray) (Pisov et al., 2020, Sun et al., 27 Sep 2025).
Prospective, multi-center clinical trials and regulatory documentation for real-world deployment (M et al., 17 Jul 2025, Ju et al., 2023, Ferdi, 31 Dec 2024).

7. Significance and Outlook

AI-based fracture detection systems represent a convergence of deep learning innovation, clinical need for rapid triage, and interpretability requirements for regulatory and practical acceptance. While substantial progress has been achieved in classification accuracy, runtime performance, and integration with PACS and clinical workflows, ongoing areas of research include rare-case generalization, explicit handling of ordinal/severity information, explainability for clinical end-users, and robust validation across diverse imaging scenarios. High-accuracy, low-latency lightweight detectors are now practical for real-world pediatric and adult fracture screening; however, widespread adoption will depend on future work in external cross-site validation, uncertainty quantification, and regulatory-compliant deployment (Ferdi, 31 Dec 2024, Sun et al., 27 Sep 2025, Ju et al., 2023, Krogue et al., 2019, Husseini et al., 2020).