MeniMV: Horn-Specific Meniscal Tear Benchmark
- MeniMV is a multi-view, horn-specific dataset featuring paired sagittal and coronal MRI images annotated with a four-tier severity scale.
- The dataset comprises 3,000 images from 750 patients collected across three centers using standardized T2WI-FS protocols for consistency.
- Benchmark evaluations show that pretrained transformer models, notably Swin-UNETR, outperform traditional CNNs in precise meniscal tear grading.
MeniMV is a multi-view, horn-specific benchmark dataset introduced to advance automated severity grading of meniscal horn tears using MRI exams. It was constructed to address the limitations of previous datasets, which primarily rely on coarse or binary study-level labels and lack precise localization and severity gradation, thereby constraining algorithmic development for clinically relevant meniscus injury analysis. MeniMV uniquely provides paired sagittal and coronal MRI images for both anterior and posterior meniscal horns, annotated with a four-tier severity scale, establishing a new standard for musculoskeletal imaging research and benchmarking (Xu et al., 20 Dec 2025).
1. Dataset Design and Construction
MeniMV was retrospectively composed of 3,000 horn-specific MRI images collected from 750 patients across three medical centers, with uniform acquisition via fat-suppressed T2-weighted sequences (T2WI-FS; sensitivity ≃ 94%). Each patient contributed two anatomically paired sagittal–coronal slice pairs (anterior and posterior horns), totaling 1,500 slice-pairs (3,000 images). Imaging parameters (3 mm slice thickness, 256Ă—256 matrix) were standardized across institutions.
Co-registration between sagittal and coronal planes employed a rigid transformation, with , where is the rotation matrix for anatomical alignment, and the translation vector.
Annotation involved six orthopedic clinicians (>10 years' experience) independently reviewing each exam; each horn's most diagnostic sagittal–coronal pair was selected and graded on the Stoller 0–III scale (0 = normal, 3 = severe). Annotations were double-validated by chief orthopedic physicians to ensure consensus. This design enables precise, horn-level, multi-view injury characterization absent in prior resources (Xu et al., 20 Dec 2025).
2. Population Demographics and Injury Grade Distribution
MeniMV’s patient cohort had broad demographic coverage (405 female, 345 male; age 14–82, mean 55.6 ± 12.7 years). The overall grade distribution across both meniscal horns was:
| Grade | Count | Percentage (%) |
|---|---|---|
| 0 | 1331 | 44.4 |
| 1 | 502 | 16.7 |
| 2 | 306 | 10.2 |
| 3 | 861 | 28.7 |
Analysis revealed grade 0 prevalence among younger cohorts, while grade 3 sharply increases after age 50, peaking in the 60–70 group. Mean ages by sex were 56.2 (±12.2) for males and 55.1 (±13.1) for females, indicating representative age and sex distributions (Xu et al., 20 Dec 2025).
3. Benchmark Evaluation: Architectures and Training Protocol
Severity grading was posed as a multi-view classification problem: . Three architectural categories were benchmarked:
- Generic CNNs (trained from scratch): ResNet-50, ResNeXt-50 (32Ă—4d), EfficientNet-B0, ShuffleNet-v2, MobileNet-v2, ConvNeXt-T, DenseNet-121, ViT-B/16.
- Domain-specific architectures: MRNet, 2.5D ResNet Fusion, DeepKnee (adapted).
- Modern pretrained backbones: ConvNeXt-V2-B, Swin-T, Swin-B, ViT-B/16 (MAE/DINO pretrained), Swin-UNETR encoder.
Training employed a hybrid objective: , where the focal loss () handled label imbalance, and the MSE term () encouraged consistency between sagittal and coronal embeddings. Baseline cross-entropy () was used for comparison. Optimization used Adam (learning rate 1e–4, weight decay 1e–5), with 32 slice-pair batches and early stopping on validation Macro-F1. Data augmentations included random rotations (±15°), horizontal flip, and intensity jittering (Xu et al., 20 Dec 2025).
4. Experimental Results and Quantitative Benchmarks
Evaluation metrics comprised Accuracy (Acc), Macro-F1, and Mean Absolute Error (MAE), computed per meniscal horn.
Backbone Performance
| Backbone | Acc (%) | Macro-F1 | MAE |
|---|---|---|---|
| ResNet-50 | 57.67 | 0.4667 | 0.91 |
| DenseNet-121 | 67.83 | 0.5730 | 0.74 |
| ViT-B/16 (scratch) | 50.33 | 0.2907 | 1.10 |
| MRNet (adapted) | 69.45 | 0.5912 | 0.68 |
| 2.5D ResNet Fusion | 71.80 | 0.6150 | 0.63 |
| DeepKnee (adapted) | 74.22 | 0.6385 | 0.58 |
| Swin-B (pretrained) | 76.38 | 0.6730 | 0.52 |
| Swin-UNETR (pt) | 76.92 | 0.6790 | 0.51 |
Pretrained Transformer architectures (notably Swin-UNETR) consistently achieved the highest performance (76.9% accuracy, 0.679 Macro-F1, MAE 0.51), outperforming both generic and domain-specific CNN designs. Most misclassifications occurred between adjacent grades (especially grades 1 and 2) and at posterior horns, indicating persistent challenges in detecting subtle or early-stage tears (Xu et al., 20 Dec 2025).
Multi-View Fusion Strategies
Experimentation with DenseNet-121 as a shared encoder evaluated three fusion strategies:
| Fusion Strategy | Accuracy (%) | Macro-F1 |
|---|---|---|
| Additive | 70.02 | 0.6021 |
| Attention-based | 72.46 | 0.6228 |
| Concatenation | 73.63 | 0.6347 |
Concatenation-based fusion consistently delivered superior results, indicating the additive value of combining complementary anatomical information across imaging planes.
Robustness Analyses
Leave-one-center-out (LOCO) cross-center validation revealed that pretrained Transformer models generalize more robustly to scanner and protocol variations:
| Model | Macro-F1 (avg) | Acc (avg %) | MAE (avg) |
|---|---|---|---|
| DeepKnee | 0.625 | 72.5 | 0.60 |
| Swin-UNETR | 0.641 | 73.6 | 0.56 |
Demographic stratification using Swin-UNETR found consistent performance across genders and age groups (e.g., Macro-F1: male 0.648, female 0.666; MAE 0.53 and 0.49, respectively) (Xu et al., 20 Dec 2025).
5. Key Findings and Identified Challenges
The introduction of MeniMV demonstrated several notable discoveries:
- Pretrained Transformers, particularly Swin-UNETR, outperformed convolutional models for multi-view meniscal grading.
- Concatenation-based fusion yielded the best performance among fusion strategies, suggesting that preserving independent view features is beneficial.
- The largest error rates were observed at grade boundaries (notably grades 1–2) and for posterior horn tear classification.
- Cross-center validation confirmed that architectures with large-scale pretraining and self-attention are less sensitive to differences in scanner or acquisition protocol.
- Mild lesions (Stoller grades 1–2) and small radial tears remained most susceptible to misclassification.
A plausible implication is that leveraging intra-study anatomical diversity and advanced feature fusion is crucial for high-fidelity automated grading (Xu et al., 20 Dec 2025).
6. Future Research Directions
Recommendations articulated in the benchmarking study for advancing the state of automated meniscal injury grading include:
- Employing semi-supervised or self-supervised pretraining on full 3D MRI volumes to capture more contextual and unlabeled data.
- Enhancing 2D–3D registration through deformable transformations to leverage volumetric anatomical information.
- Integrating explicit lesion localization (region-of-interest detection, attention mapping) and multi-scale feature pyramids to improve detection of subtle or early-grade injuries.
- Using ordinal-aware loss functions (e.g., label-distribution smoothing, ordinal regression) to respect the intrinsic order in meniscal severity grading.
- Augmenting imaging analysis with clinical metadata such as patient history or biomechanical scores for comprehensive severity modeling.
By providing 3,000 horn-specific, multi-view, expert-annotated MRI images and a suite of rigorous baselines, MeniMV establishes a foundational resource for model development, robustness assessment, and the investigation of automated, clinically relevant musculoskeletal imaging (Xu et al., 20 Dec 2025).