- The paper introduces Knee-xRAI, a framework that automates KL grading by explicitly quantifying JSN, osteophytes, and sclerosis.
- It employs a dual-path system combining XGBoost for auditable features and a ConvNeXt hybrid model for enhanced classification performance.
- The study demonstrates that integrating explicit feature extraction with deep learning can yield competitive metrics while enhancing clinical transparency.
Knee-xRAI: An Explainable AI Framework for Automatic Kellgren–Lawrence Grading of Knee Osteoarthritis
Introduction and Motivation
Knee osteoarthritis (KOA), a high-prevalence degenerative condition, is primarily evaluated radiographically using the Kellgren–Lawrence (KL) grading system—a five-grade ordinal measure defined by joint space narrowing (JSN), osteophyte formation, and subchondral sclerosis. The literature documents substantial inter-reader variability and ambiguity across KL grade definitions, with direct therapeutic and triage consequences. While previous machine learning approaches have demonstrated strong predictive performance on KL grade classification, they typically do so by compressing clinical features into opaque, end-to-end models with limited transparency for feature-specific audit or clinical review.
Knee-xRAI addresses this interpretability deficit by explicitly quantifying the cardinal radiographic findings, using modular deep learning and hybrid techniques, and integrating them into a dual-path classification system for KL grade prediction. The framework aims to deliver competitive classification metrics while providing direct, auditable evidence for each structural feature, facilitating deployment in radiologist-shortage environments and resource-constrained healthcare settings.
Methods and Framework Architecture
Knee-xRAI is structured as a four-stage pipeline:
Figure 1: A schematic overview of the Knee-xRAI pipeline, with explicit quantification of JSN, osteophytes, and sclerosis feeding feature-based and image-based KL grade classifiers.
JSN Segmentation and Quantification
JSN segmentation is performed with a U-Net++ architecture employing an EfficientNet-B4 encoder, trained on manually annotated contours to prioritize minimum joint space width (mJSW) measurement accuracy. The resulting JSN sub-vector contains compartmental mJSW, JSN rates, and asymmetry indices for both the medial and lateral joint spaces.
Figure 2: JSN module outputs with feature overlays for KL grades 0, 2, and 4, demonstrating progression in joint space narrowing.
Osteophyte Grading
Osteophyte quantification utilizes an SE-ResNet-50 network, multitasked across four anatomical sites per the OARSI atlas. Site patch extraction is derived from JSN segmentation landmarks, supported by fallback anatomical cropping for robustness. The osteophyte feature vector encodes site-specific grades, total burden, and composite metrics.
Sclerosis Classification
Subchondral sclerosis is assessed via hybrid texture-CNN modeling, applying handcrafted statistical descriptors alongside EfficientNet-B0 for binary classification (presence vs. absence). The module adapts the ROI extraction strategy to the tibial plateau for anatomical fidelity.
Figure 3: ROI extraction and sclerosis classification outcomes for representative patients, visualizing the subchondral region and classifier decision.
Dual KL Grade Classification Paths
Feature aggregation produces a 50-dimensional structured vector spanning all measured radiographic findings. Two classification paths are deployed:
- Path A: An XGBoost classifier for fully auditable, feature-level attribution via SHAP.
- Path B: A ConvNeXt-Small hybrid model fusing the structured vector with a global image encoding for optimized performance.
A Gradio-based interface provides interactive feature overlays, spatial artifacts, and class probability explanations, supporting clinical decision-making.
Experimental Results
The pipeline was evaluated on a large-scale, OAI-derived dataset (8,260 radiographs) partitioned into stratified train/validation/test splits. Feature-level annotations (400–500 per module) guided training and module evaluation.
JSN segmentation yielded Dice =0.8909 and ICC =0.8674 for mJSW relative to manual labels, validating strong reproducibility and robustness. Osteophyte grading showed heterogeneous performance across sites, with medial femur κ=0.5828 and lateral femur substantially lower (κ=0.1048), attributed to annotation volume and anatomical projection ambiguity. The sclerosis classifier, optimized for macro F1, achieved test macro F1 =0.5785 and AUC =0.6114, with performance constrained by annotation scale.
KL grade classification results are as follows:
- Path A (XGBoost): QWK =0.6294, accuracy =0.5399, macro F1 =0.5238, AUC =0.8046.
- Path B (ConvNeXt Hybrid): QWK =0.86740, accuracy =0.86741, macro F1 =0.86742, AUC =0.86743.
Ablation studies (Path A) confirmed JSN features as the dominant predictor (QWK =0.86744 alone), with osteophyte features yielding incremental gains (=0.86745QWK =0.86746) and sclerosis features contributing marginally under current labeling schema.
Figure 4: (Left) Feature-family ablation chart for QWK; (Right) global SHAP feature attribution across the test set, highlighting JSN dominance.
Structured path ablation (Path B) showed substantial QWK drops (=0.86747 under zeroing, =0.86748 under permutation), revealing that ConvNeXt exploits explicit feature-image alignment, not simply treating the structured vector as exogenous input.
Discussion
Clinical and Theoretical Implications
JSN, representing structural narrowing, is confirmed as the most decisive predictive feature in KL grading, with osteophyte quantification providing supplementary structural information. The sclerosis module, at current annotation scale, mainly serves to close the feature audit loop. The architecture explicitly implements mechanistic transparency through independently measurable features, directly mapping the KL grade to auditable radiographic evidence.
Despite a performance gap between fully transparent (Path A) and deep hybrid (Path B) modeling, structured features materially contribute to image-based classification performance, as established through inference-time ablation. This demonstrates that interpretability and accuracy need not be mutually exclusive—a feature-decomposed, concept-driven approach approaches state-of-the-art classification metrics while maintaining clinical auditability.
Limitations and Future Directions
Annotation scarcity, particularly for certain anatomical sites and sclerosis, limits generalizability and classifier robustness. Image pixel calibration is absent, restricting absolute measurement fidelity without external scale factors. Inter-annotator agreement metrics were not calculated, and external validation remains pending. Advancement requires expansion of annotation volumes, incorporation of calibrated imaging, and rigorous testing across diverse deployment scenarios, including the intended low-resource environments.
On a theoretical level, the modular design aligns with calls for a post-hoc XAI paradigm shift toward mechanistically explicit medical AI systems. As clinical settings demand traceable reasoning and auditable decision-making, future developments should expand toward richer feature quantification and cross-modal integration.
Conclusion
Knee-xRAI operationalizes interpretable, auditable AI for KOA grading by structuring KL grade classification around directly measured radiographic features. The framework's dual-path architecture provides both competitive predictive performance and feature-level transparency, laying the groundwork for trustworthy AI deployment in clinical radiology and advancing the field toward mechanistically transparent medical imaging diagnostics.