Horn-Specific Meniscus Injury Grading
- The paper introduces a standardized MRI grading framework that precisely evaluates meniscal horn tears using dual-view imaging.
- A curated MeniMV dataset of 750 knee MRI studies enables dual-head classification with co-registered sagittal and coronal views.
- Baseline models demonstrate strong performance while highlighting challenges such as adjacent grade confusion and domain generalization.
Horn-specific meniscus injury grading is a standardized framework for evaluating meniscal horn tears based on multi-view magnetic resonance imaging (MRI), yielding fine-grained localization and severity assessments for both the anterior and posterior meniscal horns. Motivated by clinical requirements for precise grading, the horn-specific approach contrasts with earlier practices relying on binary or study-level labels. The MeniMV benchmark operationalizes this paradigm with a curated multi-view dataset and stratified annotation protocol, enabling rigorous evaluation of automated grading models and systematic investigation of error modes (Xu et al., 20 Dec 2025).
1. Dataset Construction and Annotation Protocol
The MeniMV dataset comprises 750 retrospectively curated, de-identified knee MRI studies sourced from three tertiary hospitals, encompassing 405 female and 345 male subjects (age range 14–82 years, mean 55.6 ± 12.7 years). Each study includes fat-suppressed T2-weighted sagittal and coronal series, sequences favored for detecting intrameniscal fluid due to high clinical sensitivity (∼94%).
Preprocessing involved DICOM anonymization (DicomAnonymizer), automated exclusion of motion-degraded slices, and selection of diagnostically informative slices. Six senior orthopedic physicians (≥10 years experience) selected four slices per meniscus horn (anterior/posterior) from each plane, producing two slice pairs per exam (anterior and posterior), totaling four images (sagittal + coronal) per patient.
Annotation executed per the Stoller 0–III scale, mapped to grades 0–3, was performed independently by chief orthopedic physicians. Consensus validation ensured >95% label agreement prior to dataset finalization. The dataset comprises 1,500 labeled horn-specific cases (750 patients × 2 horns), each case represented by one sagittal–coronal slice-pair (totaling 3,000 pairs or 6,000 DICOM images). Grade prevalence is summarized in Table 1.
| Grade | Description | Case Count (%) |
|---|---|---|
| 0 | Normal | 1,331 (44.4%) |
| 1 | Mild | 502 (16.7%) |
| 2 | Moderate | 306 (10.2%) |
| 3 | Severe | 861 (28.7%) |
2. Horn-Specific Four-Tier Grading Scale
The grading protocol is adapted from the Stoller classification, applied independently to anterior and posterior meniscal horns. Definition of grades:
- Grade 0 (Normal): Uniform low signal intensity within meniscal tissue; absence of intrameniscal fluid.
- Grade 1 (Mild): Focal or linear increased signal limited to the meniscal substance, not reaching the articular surface.
- Grade 2 (Moderate): Linear or punctate high signal reaching but not crossing the articular surface.
- Grade 3 (Severe): Fluid-equivalent signal traversing one or both articular surfaces, indicative of complete tear.
Grading is performed identically in sagittal and coronal planes to match radiological best practices. The approach provides granular localization and severity rating, facilitating evaluation of both clinical and algorithmic workflows.
3. Multi-View Co-Registration and Feature Fusion
The dataset architecture leverages the dual-perspective diagnostic context essential in clinical assessment. Sagittal and coronal series are inherently co-registered via patient/scanner spatial tags, obviating the need for additional deformable registration. Slice indices correspond anatomically, ensuring consistent meniscus cross-sections between views.
The feature fusion pipeline consists of passing each view (, ) through a shared encoder backbone () to obtain embeddings (, ), which are channel-wise concatenated:
A positional indicator (posterior=0, anterior=1) routes the fused representation to the appropriate classification head, permitting region-specific boundary learning.
4. Model Architectures and Training Methodologies
MeniMV supports benchmarking both CNN and Transformer-based architectures:
- CNN Backbones: ResNet-50, DenseNet-121, EfficientNet-B0, trained from scratch using 2D grayscale slices (224×224, zero mean/unit variance). Typical stack is conv→BN→ReLU, with global average pooling for 1,024–2,048-dimensional features.
- Transformer Backbones: ViT-B/16, Swin Transformer variants, Swin-UNETR encoder. Images are partitioned into 16×16 patches, linearly embedded, augmented with positional encodings, and processed by multi-head self-attention, yielding 768–1,024-dimensional pooled token representations.
Novel components include dual-head classification—separate fully connected layers for anterior and posterior predictions (softmax over grades 0–3, selected by )—and a cross-view consistency regularizer encouraging through an MSE penalty.
Training employs a hybrid loss:
Focal loss controls class imbalance via per-class weights and focusing parameter :
Cross-view MSE:
Optimization uses AdamW (learning rate , weight decay ), cosine scheduling (100 epochs, 10-epoch warmup), batch size 32, and balanced grade sampling. Data augmentations include random flips, small rotations (±15°), intensity jitter (±10%), and cropping.
5. Evaluation Metrics and Statistical Summary
Model performance is quantified using standard multiclass metrics:
- Accuracy:
- Sensitivity (Recall, per class ):
- Specificity (per class ):
- Precision (per class ):
- F1 score (per class ):
- Macro-F1: Unweighted average of all
- Cohen’s kappa:
- Mean Absolute Error (MAE):
Baseline results obtained on the MeniMV dataset are summarized in Table 2:
| Model | Accuracy (%) | Macro-F1 | MAE |
|---|---|---|---|
| DenseNet-121 | 67.8 | 0.573 | 0.74 |
| MRNet (adapted) | 69.5 | 0.591 | 0.68 |
| 2.5D ResNet Fusion | 71.8 | 0.615 | 0.63 |
| DeepKnee | 74.2 | 0.638 | 0.58 |
| Swin-UNETR encoder | 76.9 | 0.679 | 0.51 |
6. Error Modes and Robustness Considerations
Analysis of baseline models reveals persistent challenges:
- Adjacent Grade Confusion: Misclassification between grades 1 and 2 comprises >60% of all errors, attributed to subtle signal changes at the articular margin.
- Grade Imbalance: Under-representation of Grade 1 yields relatively low sensitivity (~0.48) even with class-balanced sampling.
- Cross-Domain Generalization: Pretrained Transformers (Swin-UNETR) outperform in cross-center leave-one-center-out experiments (Macro-F1 = 0.641, Acc = 73.6%, MAE = 0.56), yet domain-specific prior models (DeepKnee) remain competitive (Acc = 72.5%, Macro-F1 = 0.625).
- Demographic Inference: Slight performance gap in younger (<40 years) and female cohorts (Macro-F1 gap ≤0.025) suggests future directions for bias mitigation.
This suggests that further gains may require high-resolution modeling of marginal signal features, advanced data-balancing schemes, and incorporation of clinical/demographic priors.
7. Outlook, Benchmark Utility, and Methodological Implications
MeniMV establishes a clinically and diagnostically faithful benchmark for horn-specific meniscus grading, uniquely enabling joint assessment in anatomically-coherent views and precise stratification of tear severity. The combination of dual-view fusion, dual-head classifiers, and hybrid focal–MSE loss provides strong algorithmic baselines. Persistent confusion in mild/moderate grades and imperfect cross-domain robustness highlight the need for novel fusion strategies, improved attention mechanisms, and metadata integration.
A plausible implication is that future automated meniscus grading systems will benefit from explicit modeling of domain shifts and demographic effects, as well as architectural advances in both representation and fusion. The horn-specific paradigm, grounded in MeniMV, offers a robust platform for both method development and deployment evaluation in musculoskeletal imaging research (Xu et al., 20 Dec 2025).