Model Performance Delta (MPD)

Updated 2 December 2025

Model Performance Delta (MPD) is an interpretable interval-scale metric that maps performance differences between models to probabilities using log-odds.
It enables cross-validation and compositional comparisons by converting raw metrics like AUC and RMSE into actionable, probability-based insights.
MPD extends to hardware scaling by predicting performance changes from resource augmentation, aiding in efficient architectural design.

Model Performance Delta (MPD) constitutes a rigorously interpretable interval-scale metric for quantifying performance differences between models, with application both in statistical comparison of predictive power and in analytic resource scaling for high-performance hardware systems. MPD subsumes traditional pairwise measures (such as ΔAUC or ΔRMSE) by mapping differences to a probability scale with formal log-odds semantics, and extends to closed-form prediction of performance changes due to architectural resource augmentation.

1. Formal Models for Quantifying Performance Delta

MPD is instantiated differently based on the domain: statistical model comparison or hardware performance scaling.

Predictive Model Setting (EPP framework):

For models $M = \{M_1, \ldots, M_n\}$ evaluated via randomized rounds $r \sim D$ (e.g., k-fold splits), the scalar $s_r(M_i)$ (AUC, RMSE, etc.) induces win indicators $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ . Empirical probabilities $p_{i, j}$ represent the chance that $M_i$ outperforms $M_j$ in a random round. The EPP score $\beta_i$ for model $M_i$ is determined via logistic regression over all duels:

$\operatorname{logit}(p_{i,j}) = \log\frac{p_{i,j}}{1-p_{i,j}} = \beta_i - \beta_j$

The MPD between models $r \sim D$ 0 and $r \sim D$ 1 is then defined as:

Log-odds MPD: $r \sim D$ 2
Probability-based MPD: $r \sim D$ 3, where $r \sim D$ 4

Hardware Resource Scaling (DeLTA framework):

For CNN convolutions on GPUs, layer execution involves interleaved memory and compute streams. Performance $r \sim D$ 5 is bounded by maximum durations among compute ( $r \sim D$ 6), shared-mem ( $r \sim D$ 7), load-latency ( $r \sim D$ 8), and bandwidth ( $r \sim D$ 9). Altering resource $s_r(M_i)$ 0 by factor $s_r(M_i)$ 1 rescales only the relevant execution times:

$s_r(M_i)$ 2

$s_r(M_i)$ 3

2. Interval-Scale Interpretation and Probabilistic Semantics

MPD as formalized via EPP scores operates on an interval scale wherein differences $s_r(M_i)$ 4 have invariant probabilistic interpretation across datasets and metrics, overcoming weaknesses of raw score differentials (such as non-comparability and lack of direct meaning). Specifically, a fixed MPD yields the same outperformance probability regardless of dataset specifics:

$s_r(M_i)$ 5
Stability is encoded; frequent but narrow wins yield higher EPP than occasional large wins with overall equal average metric.

Conversely, hardware-side MPD via DeLTA provides a closed-form analytic recipe for predicting performance change when scaling architectural resources, with exact matching to measured performance in empirical cases. For example, doubling SM count or DRAM bandwidth applies multiplicative factors only to the streams bounded by those resources.

3. Addressing Classical Metric Shortcomings

Traditional metrics (AUC, F₁, ACC, RMSE) are ordinal or ratio scales where score differences lack universal meaning and stability across folds is ignored. EPP-derived MPD resolves these issues:

Direct probability interpretation: MPD corresponds precisely to log-odds (logit) of outperforming.
Interval scale: Differences have consistent probabilistic semantics independent of metric or dataset.
Cross-validation stability: Each fold constitutes an independent duel; stability across folds is encoded in EPP.
Cross-dataset comparability: No dataset-specific offsets or stretching; β parameters are dimensionless.

For hardware systems, DeLTA’s MPD similarly abstracts away metric-specific scale, allowing resource-based compositionality and prediction independent of layer details, provided the bounds are well-characterized.

4. Aggregation, Composition, and Additivity Properties

EPP-based MPD generalizes traditional pairwise comparisons, mapping every score differential into log-odds or explicit probability. It presents additive composition:

For models $s_r(M_i)$ 6 beats $s_r(M_i)$ 7 with odds $s_r(M_i)$ 8 and $s_r(M_i)$ 9 beats $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 0 with odds $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 1, the odds $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 2 beats $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 3 are $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 4.
This is opposed to metric-based pairwise differences, which lack such compositionality.

On the hardware side, when scaling multiple resources simultaneously, one multiplies relevant rescaling factors for each stream, and the resulting MPD is derived by recomputing the maximal execution bound.

5. Empirical Applications and Case Studies

Predictive Power (EPP):

In the main experiment, four algorithms (GBM, GLMnet, k-NN, RandomForest) × 11 hyperparameter settings × 11 OpenML datasets, each with 20 random splits, generated 9,680 pairwise comparisons. Logistic fit yielded 484 EPP scores. Key findings include:

Hyperparameter-specific β (e.g., $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 5-NN values ranging ±0.4 yield MPD=0.8 → $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 6)
Algorithm ordering (Random Forest always β > 0, beats “average” model with $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 7)
Dataset-specific reversals in β ordering (illustrated in dataset #334)

GPU Performance (DeLTA):

For ResNet-152 convolutions, scaling from Titan Xp to V100 (SMs: 30 $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 884, BW_MAC: 12.1 $w_r(i, j) = \mathbf{1}\{s_r(M_i) > s_r(M_j)\}$ 914.8 TFLOP, BW_DRAM: 450 $p_{i, j}$ 0850 GB/s):

Both are compute-bound layers: $p_{i, j}$ 1
For DRAM-bound layers, $p_{i, j}$ 2 can reach factor $p_{i, j}$ 3 after doubling BW_DRAM, but aggregate speedup reflects the mixture of bounds.

6. Direct Comparability Across Datasets and Architectures

Identifiability is achieved by centering all EPP scores (e.g., $p_{i, j}$ 4 or $p_{i, j}$ 5). This guarantees that MPD computed on different datasets is universally interpretable, independent of their native metric scales. In hardware settings, MPD admits compositional scaling; the derived speedup applies over heterogeneous layers and workloads, validated by empirical agreement.

7. Summary and Broader Implications

Model Performance Delta represents a principled basis for reporting, comparing, and composing performance differences. In predictive modeling, EPP-derived MPD is superior to raw metric differences, offering direct probabilistic interpretation, interval scaling, encoding of stability, and universality across datasets. In hardware performance modeling, analytic frameworks like DeLTA translate resource scaling into actionable predictions for throughput improvement with validated accuracy. This suggests MPD, grounded in log-odds and probability, is an essential metric for both statistical model selection and architectural design optimization (Gosiewska et al., 2019, Lym et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

EPP: interpretable score of model predictive power (2019)

DeLTA: GPU Performance Model for Deep Learning Applications with In-depth Memory System Traffic Analysis (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Model Performance Delta (MPD).