Model Performance Delta (MPD)
- Model Performance Delta (MPD) is an interpretable interval-scale metric that maps performance differences between models to probabilities using log-odds.
- It enables cross-validation and compositional comparisons by converting raw metrics like AUC and RMSE into actionable, probability-based insights.
- MPD extends to hardware scaling by predicting performance changes from resource augmentation, aiding in efficient architectural design.
Model Performance Delta (MPD) constitutes a rigorously interpretable interval-scale metric for quantifying performance differences between models, with application both in statistical comparison of predictive power and in analytic resource scaling for high-performance hardware systems. MPD subsumes traditional pairwise measures (such as ΔAUC or ΔRMSE) by mapping differences to a probability scale with formal log-odds semantics, and extends to closed-form prediction of performance changes due to architectural resource augmentation.
1. Formal Models for Quantifying Performance Delta
MPD is instantiated differently based on the domain: statistical model comparison or hardware performance scaling.
Predictive Model Setting (EPP framework):
For models evaluated via randomized rounds (e.g., k-fold splits), the scalar (AUC, RMSE, etc.) induces win indicators . Empirical probabilities represent the chance that outperforms in a random round. The EPP score for model is determined via logistic regression over all duels:
The MPD between models and is then defined as:
- Log-odds MPD:
- Probability-based MPD: , where
Hardware Resource Scaling (DeLTA framework):
For CNN convolutions on GPUs, layer execution involves interleaved memory and compute streams. Performance is bounded by maximum durations among compute (), shared-mem (), load-latency (), and bandwidth (). Altering resource by factor rescales only the relevant execution times:
$t_i^{\mathrm{new}} = t_i^{\mathrm{base}} \cdot \begin{cases} 1/\alpha & \text{if stream %%%%21%%%% depends on %%%%22%%%%} \ 1 & \text{otherwise} \end{cases}$
2. Interval-Scale Interpretation and Probabilistic Semantics
MPD as formalized via EPP scores operates on an interval scale wherein differences have invariant probabilistic interpretation across datasets and metrics, overcoming weaknesses of raw score differentials (such as non-comparability and lack of direct meaning). Specifically, a fixed MPD yields the same outperformance probability regardless of dataset specifics:
- Stability is encoded; frequent but narrow wins yield higher EPP than occasional large wins with overall equal average metric.
Conversely, hardware-side MPD via DeLTA provides a closed-form analytic recipe for predicting performance change when scaling architectural resources, with exact matching to measured performance in empirical cases. For example, doubling SM count or DRAM bandwidth applies multiplicative factors only to the streams bounded by those resources.
3. Addressing Classical Metric Shortcomings
Traditional metrics (AUC, F₁, ACC, RMSE) are ordinal or ratio scales where score differences lack universal meaning and stability across folds is ignored. EPP-derived MPD resolves these issues:
- Direct probability interpretation: MPD corresponds precisely to log-odds (logit) of outperforming.
- Interval scale: Differences have consistent probabilistic semantics independent of metric or dataset.
- Cross-validation stability: Each fold constitutes an independent duel; stability across folds is encoded in EPP.
- Cross-dataset comparability: No dataset-specific offsets or stretching; β parameters are dimensionless.
For hardware systems, DeLTA’s MPD similarly abstracts away metric-specific scale, allowing resource-based compositionality and prediction independent of layer details, provided the bounds are well-characterized.
4. Aggregation, Composition, and Additivity Properties
EPP-based MPD generalizes traditional pairwise comparisons, mapping every score differential into log-odds or explicit probability. It presents additive composition:
- For models beats with odds and beats with odds , the odds beats are .
- This is opposed to metric-based pairwise differences, which lack such compositionality.
On the hardware side, when scaling multiple resources simultaneously, one multiplies relevant rescaling factors for each stream, and the resulting MPD is derived by recomputing the maximal execution bound.
5. Empirical Applications and Case Studies
Predictive Power (EPP):
In the main experiment, four algorithms (GBM, GLMnet, k-NN, RandomForest) × 11 hyperparameter settings × 11 OpenML datasets, each with 20 random splits, generated 9,680 pairwise comparisons. Logistic fit yielded 484 EPP scores. Key findings include:
- Hyperparameter-specific β (e.g., -NN values ranging ±0.4 yield MPD=0.8 → )
- Algorithm ordering (Random Forest always β > 0, beats “average” model with )
- Dataset-specific reversals in β ordering (illustrated in dataset #334)
GPU Performance (DeLTA):
For ResNet-152 convolutions, scaling from Titan Xp to V100 (SMs: 3084, BW_MAC: 12.114.8 TFLOP, BW_DRAM: 450850 GB/s):
- Both are compute-bound layers:
- For DRAM-bound layers, can reach factor after doubling BW_DRAM, but aggregate speedup reflects the mixture of bounds.
6. Direct Comparability Across Datasets and Architectures
Identifiability is achieved by centering all EPP scores (e.g., or ). This guarantees that MPD computed on different datasets is universally interpretable, independent of their native metric scales. In hardware settings, MPD admits compositional scaling; the derived speedup applies over heterogeneous layers and workloads, validated by empirical agreement.
7. Summary and Broader Implications
Model Performance Delta represents a principled basis for reporting, comparing, and composing performance differences. In predictive modeling, EPP-derived MPD is superior to raw metric differences, offering direct probabilistic interpretation, interval scaling, encoding of stability, and universality across datasets. In hardware performance modeling, analytic frameworks like DeLTA translate resource scaling into actionable predictions for throughput improvement with validated accuracy. This suggests MPD, grounded in log-odds and probability, is an essential metric for both statistical model selection and architectural design optimization (Gosiewska et al., 2019, Lym et al., 2019).