Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

Published 25 Apr 2026 in cs.CV, cs.AI, and cs.LG | (2604.23435v1)

Abstract: Radiographic grading of knee osteoarthritis (KOA) with the Kellgren-Lawrence (KL) system is limited by inter-reader variability and the opacity of current deep learning approaches, which predict KL grades directly from images without decomposing structural features. We present Knee-xRAI, a modular framework that independently quantifies the three cardinal radiographic features of KOA (joint space narrowing [JSN], osteophytes, and subchondral sclerosis) and integrates them into an explainable KL grade classification. The pipeline combines U-Net++ segmentation for contour-based JSN measurement, an SE-ResNet-50 network for per-site osteophyte grading (OARSI scale), and a hybrid texture-CNN classifier for binary sclerosis quantification. The resulting 50-dimensional structured feature vector feeds two complementary classification paths. An XGBoost path supports SHAP-based feature attribution. A ConvNeXt hybrid path combines the structured vector with a full-image encoder for enhanced predictive performance. Evaluated on 8,260 radiographs from an OAI-derived dataset, the JSN module achieved a Dice coefficient of 0.8909 and an mJSW intraclass correlation of 0.8674 against manual annotations. The ConvNeXt hybrid path reached a test quadratic weighted kappa (QWK) of 0.8436 and AUC of 0.9017. The transparent XGBoost path achieved a test QWK of 0.6294 with full feature-level audit capability. Ablation confirmed JSN as the dominant predictor (QWK = 0.6103 alone), with osteophyte features providing consistent incremental gain (+0.0183) and sclerosis contributing marginally. Inference-time ablation of Path B confirmed the structured pathway contributes materially beyond the image encoder, with QWK drops of 0.098 (feature zeroing) and 0.284 (feature-image permutation). Knee-xRAI explicitly quantifies all three KL-defining radiographic features within a single auditable pipeline.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces Knee-xRAI, a framework that automates KL grading by explicitly quantifying JSN, osteophytes, and sclerosis.
It employs a dual-path system combining XGBoost for auditable features and a ConvNeXt hybrid model for enhanced classification performance.
The study demonstrates that integrating explicit feature extraction with deep learning can yield competitive metrics while enhancing clinical transparency.

Knee-xRAI: An Explainable AI Framework for Automatic Kellgren–Lawrence Grading of Knee Osteoarthritis

Introduction and Motivation

Knee osteoarthritis (KOA), a high-prevalence degenerative condition, is primarily evaluated radiographically using the Kellgren–Lawrence (KL) grading system—a five-grade ordinal measure defined by joint space narrowing (JSN), osteophyte formation, and subchondral sclerosis. The literature documents substantial inter-reader variability and ambiguity across KL grade definitions, with direct therapeutic and triage consequences. While previous machine learning approaches have demonstrated strong predictive performance on KL grade classification, they typically do so by compressing clinical features into opaque, end-to-end models with limited transparency for feature-specific audit or clinical review.

Knee-xRAI addresses this interpretability deficit by explicitly quantifying the cardinal radiographic findings, using modular deep learning and hybrid techniques, and integrating them into a dual-path classification system for KL grade prediction. The framework aims to deliver competitive classification metrics while providing direct, auditable evidence for each structural feature, facilitating deployment in radiologist-shortage environments and resource-constrained healthcare settings.

Methods and Framework Architecture

Knee-xRAI is structured as a four-stage pipeline:

Figure 1: A schematic overview of the Knee-xRAI pipeline, with explicit quantification of JSN, osteophytes, and sclerosis feeding feature-based and image-based KL grade classifiers.

JSN Segmentation and Quantification

JSN segmentation is performed with a U-Net++ architecture employing an EfficientNet-B4 encoder, trained on manually annotated contours to prioritize minimum joint space width (mJSW) measurement accuracy. The resulting JSN sub-vector contains compartmental mJSW, JSN rates, and asymmetry indices for both the medial and lateral joint spaces.

Figure 2: JSN module outputs with feature overlays for KL grades 0, 2, and 4, demonstrating progression in joint space narrowing.

Osteophyte Grading

Osteophyte quantification utilizes an SE-ResNet-50 network, multitasked across four anatomical sites per the OARSI atlas. Site patch extraction is derived from JSN segmentation landmarks, supported by fallback anatomical cropping for robustness. The osteophyte feature vector encodes site-specific grades, total burden, and composite metrics.

Sclerosis Classification

Subchondral sclerosis is assessed via hybrid texture-CNN modeling, applying handcrafted statistical descriptors alongside EfficientNet-B0 for binary classification (presence vs. absence). The module adapts the ROI extraction strategy to the tibial plateau for anatomical fidelity.

Figure 3: ROI extraction and sclerosis classification outcomes for representative patients, visualizing the subchondral region and classifier decision.

Dual KL Grade Classification Paths

Feature aggregation produces a 50-dimensional structured vector spanning all measured radiographic findings. Two classification paths are deployed:

Path A: An XGBoost classifier for fully auditable, feature-level attribution via SHAP.
Path B: A ConvNeXt-Small hybrid model fusing the structured vector with a global image encoding for optimized performance.

A Gradio-based interface provides interactive feature overlays, spatial artifacts, and class probability explanations, supporting clinical decision-making.

Experimental Results

The pipeline was evaluated on a large-scale, OAI-derived dataset (8,260 radiographs) partitioned into stratified train/validation/test splits. Feature-level annotations (400–500 per module) guided training and module evaluation.

JSN segmentation yielded Dice $= 0.8909$ and ICC $= 0.8674$ for mJSW relative to manual labels, validating strong reproducibility and robustness. Osteophyte grading showed heterogeneous performance across sites, with medial femur $\kappa = 0.5828$ and lateral femur substantially lower ( $\kappa = 0.1048$ ), attributed to annotation volume and anatomical projection ambiguity. The sclerosis classifier, optimized for macro F1, achieved test macro F1 $= 0.5785$ and AUC $= 0.6114$ , with performance constrained by annotation scale.

KL grade classification results are as follows:

Path A (XGBoost): QWK $= 0.6294$ , accuracy $= 0.5399$ , macro F1 $= 0.5238$ , AUC $= 0.8046$ .
Path B (ConvNeXt Hybrid): QWK $= 0.8674$ 0, accuracy $= 0.8674$ 1, macro F1 $= 0.8674$ 2, AUC $= 0.8674$ 3.

Ablation studies (Path A) confirmed JSN features as the dominant predictor (QWK $= 0.8674$ 4 alone), with osteophyte features yielding incremental gains ( $= 0.8674$ 5QWK $= 0.8674$ 6) and sclerosis features contributing marginally under current labeling schema.

Figure 4: (Left) Feature-family ablation chart for QWK; (Right) global SHAP feature attribution across the test set, highlighting JSN dominance.

Structured path ablation (Path B) showed substantial QWK drops ( $= 0.8674$ 7 under zeroing, $= 0.8674$ 8 under permutation), revealing that ConvNeXt exploits explicit feature-image alignment, not simply treating the structured vector as exogenous input.

Discussion

Clinical and Theoretical Implications

JSN, representing structural narrowing, is confirmed as the most decisive predictive feature in KL grading, with osteophyte quantification providing supplementary structural information. The sclerosis module, at current annotation scale, mainly serves to close the feature audit loop. The architecture explicitly implements mechanistic transparency through independently measurable features, directly mapping the KL grade to auditable radiographic evidence.

Despite a performance gap between fully transparent (Path A) and deep hybrid (Path B) modeling, structured features materially contribute to image-based classification performance, as established through inference-time ablation. This demonstrates that interpretability and accuracy need not be mutually exclusive—a feature-decomposed, concept-driven approach approaches state-of-the-art classification metrics while maintaining clinical auditability.

Limitations and Future Directions

Annotation scarcity, particularly for certain anatomical sites and sclerosis, limits generalizability and classifier robustness. Image pixel calibration is absent, restricting absolute measurement fidelity without external scale factors. Inter-annotator agreement metrics were not calculated, and external validation remains pending. Advancement requires expansion of annotation volumes, incorporation of calibrated imaging, and rigorous testing across diverse deployment scenarios, including the intended low-resource environments.

On a theoretical level, the modular design aligns with calls for a post-hoc XAI paradigm shift toward mechanistically explicit medical AI systems. As clinical settings demand traceable reasoning and auditable decision-making, future developments should expand toward richer feature quantification and cross-modal integration.

Conclusion

Knee-xRAI operationalizes interpretable, auditable AI for KOA grading by structuring KL grade classification around directly measured radiographic features. The framework's dual-path architecture provides both competitive predictive performance and feature-level transparency, laying the groundwork for trustworthy AI deployment in clinical radiology and advancing the field toward mechanistically transparent medical imaging diagnostics.

Markdown Report Issue