Papers
Topics
Authors
Recent
2000 character limit reached

Model Performance Delta (MPD)

Updated 2 December 2025
  • Model Performance Delta (MPD) is an interpretable interval-scale metric that maps performance differences between models to probabilities using log-odds.
  • It enables cross-validation and compositional comparisons by converting raw metrics like AUC and RMSE into actionable, probability-based insights.
  • MPD extends to hardware scaling by predicting performance changes from resource augmentation, aiding in efficient architectural design.

Model Performance Delta (MPD) constitutes a rigorously interpretable interval-scale metric for quantifying performance differences between models, with application both in statistical comparison of predictive power and in analytic resource scaling for high-performance hardware systems. MPD subsumes traditional pairwise measures (such as ΔAUC or ΔRMSE) by mapping differences to a probability scale with formal log-odds semantics, and extends to closed-form prediction of performance changes due to architectural resource augmentation.

1. Formal Models for Quantifying Performance Delta

MPD is instantiated differently based on the domain: statistical model comparison or hardware performance scaling.

Predictive Model Setting (EPP framework):

For models M={M1,,Mn}M = \{M_1, \ldots, M_n\} evaluated via randomized rounds rDr \sim D (e.g., k-fold splits), the scalar sr(Mi)s_r(M_i) (AUC, RMSE, etc.) induces win indicators wr(i,j)=1{sr(Mi)>sr(Mj)}w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}. Empirical probabilities pi,jp_{i, j} represent the chance that MiM_i outperforms MjM_j in a random round. The EPP score βi\beta_i for model MiM_i is determined via logistic regression over all duels:

logit(pi,j)=logpi,j1pi,j=βiβj\operatorname{logit}(p_{i,j}) = \log\frac{p_{i,j}}{1-p_{i,j}} = \beta_i - \beta_j

The MPD between models AA and BB is then defined as:

  • Log-odds MPD: ΔEPP(A,B)=βAβB\Delta \operatorname{EPP}(A,B) = \beta_A - \beta_B
  • Probability-based MPD: MPDp(A,B)=σ(βAβB)0.5MPD_p(A,B) = \sigma(\beta_A-\beta_B) - 0.5, where σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x})

Hardware Resource Scaling (DeLTA framework):

For CNN convolutions on GPUs, layer execution involves interleaved memory and compute streams. Performance PP is bounded by maximum durations among compute (tCSt_{CS}), shared-mem (tSASt_{SAS}), load-latency (tGLSt_{GLS}), and bandwidth (tBWt_{BW}). Altering resource rr by factor α\alpha rescales only the relevant execution times:

$t_i^{\mathrm{new}} = t_i^{\mathrm{base}} \cdot \begin{cases} 1/\alpha & \text{if stream %%%%21%%%% depends on %%%%22%%%%} \ 1 & \text{otherwise} \end{cases}$

ΔPrPnewPbase=TbaseTnew\Delta P_r \equiv \frac{P_{\mathrm{new}}}{P_{\mathrm{base}}} = \frac{T_{\mathrm{base}}}{T_{\mathrm{new}}}

2. Interval-Scale Interpretation and Probabilistic Semantics

MPD as formalized via EPP scores operates on an interval scale wherein differences βiβj\beta_i - \beta_j have invariant probabilistic interpretation across datasets and metrics, overcoming weaknesses of raw score differentials (such as non-comparability and lack of direct meaning). Specifically, a fixed MPD yields the same outperformance probability regardless of dataset specifics:

  • ΔEPP(A,B)=0.4    P(A>B)=σ(0.4)0.60\Delta \operatorname{EPP}(A,B)=0.4 \implies P(A>B)=\sigma(0.4)\approx 0.60
  • Stability is encoded; frequent but narrow wins yield higher EPP than occasional large wins with overall equal average metric.

Conversely, hardware-side MPD via DeLTA provides a closed-form analytic recipe for predicting performance change when scaling architectural resources, with exact matching to measured performance in empirical cases. For example, doubling SM count or DRAM bandwidth applies multiplicative factors only to the streams bounded by those resources.

3. Addressing Classical Metric Shortcomings

Traditional metrics (AUC, F₁, ACC, RMSE) are ordinal or ratio scales where score differences lack universal meaning and stability across folds is ignored. EPP-derived MPD resolves these issues:

  • Direct probability interpretation: MPD corresponds precisely to log-odds (logit) of outperforming.
  • Interval scale: Differences have consistent probabilistic semantics independent of metric or dataset.
  • Cross-validation stability: Each fold constitutes an independent duel; stability across folds is encoded in EPP.
  • Cross-dataset comparability: No dataset-specific offsets or stretching; β parameters are dimensionless.

For hardware systems, DeLTA’s MPD similarly abstracts away metric-specific scale, allowing resource-based compositionality and prediction independent of layer details, provided the bounds are well-characterized.

4. Aggregation, Composition, and Additivity Properties

EPP-based MPD generalizes traditional pairwise comparisons, mapping every score differential into log-odds or explicit probability. It presents additive composition:

  • For models AA beats BB with odds eΔ1e^{\Delta_1} and BB beats CC with odds eΔ2e^{\Delta_2}, the odds AA beats CC are eΔ1+Δ2e^{\Delta_1 + \Delta_2}.
  • This is opposed to metric-based pairwise differences, which lack such compositionality.

On the hardware side, when scaling multiple resources simultaneously, one multiplies relevant rescaling factors for each stream, and the resulting MPD is derived by recomputing the maximal execution bound.

5. Empirical Applications and Case Studies

Predictive Power (EPP):

In the main experiment, four algorithms (GBM, GLMnet, k-NN, RandomForest) × 11 hyperparameter settings × 11 OpenML datasets, each with 20 random splits, generated 9,680 pairwise comparisons. Logistic fit yielded 484 EPP scores. Key findings include:

  • Hyperparameter-specific β (e.g., kk-NN values ranging ±0.4 yield MPD=0.8 → P0.69P\approx 0.69)
  • Algorithm ordering (Random Forest always β > 0, beats “average” model with P>0.5P>0.5)
  • Dataset-specific reversals in β ordering (illustrated in dataset #334)

GPU Performance (DeLTA):

For ResNet-152 convolutions, scaling from Titan Xp to V100 (SMs: 30\to84, BW_MAC: 12.1\to14.8 TFLOP, BW_DRAM: 450\to850 GB/s):

  • Both are compute-bound layers: ΔP1.22\Delta P \approx 1.22
  • For DRAM-bound layers, ΔP\Delta P can reach factor 2×2\times after doubling BW_DRAM, but aggregate speedup reflects the mixture of bounds.

6. Direct Comparability Across Datasets and Architectures

Identifiability is achieved by centering all EPP scores (e.g., βi=0\sum \beta_i = 0 or β1=0\beta_1=0). This guarantees that MPD computed on different datasets is universally interpretable, independent of their native metric scales. In hardware settings, MPD admits compositional scaling; the derived speedup applies over heterogeneous layers and workloads, validated by empirical agreement.

7. Summary and Broader Implications

Model Performance Delta represents a principled basis for reporting, comparing, and composing performance differences. In predictive modeling, EPP-derived MPD is superior to raw metric differences, offering direct probabilistic interpretation, interval scaling, encoding of stability, and universality across datasets. In hardware performance modeling, analytic frameworks like DeLTA translate resource scaling into actionable predictions for throughput improvement with validated accuracy. This suggests MPD, grounded in log-odds and probability, is an essential metric for both statistical model selection and architectural design optimization (Gosiewska et al., 2019, Lym et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Model Performance Delta (MPD).