Multi-Class Informedness & Markedness Metrics
- Multi-Class Informedness and Markedness are evaluation metrics that generalize binary measures to robustly assess classifier performance in imbalanced multi-class settings.
- They compute per-class informedness (TPR minus FPR) and markedness (Precision plus InversePrecision minus 1) by decomposing the confusion matrix with micro- and macro-averaging.
- Their design ensures symmetric treatment of all classes and offers a chance-corrected, probabilistic interpretation essential for reliable model evaluation.
Multi-class Informedness (ΔP′) and Markedness (ΔP) are evaluation metrics designed to generalize their binary counterparts to the multi-class classification setting. Originating from the need for metrics that are unbiased with respect to base rates and label distributions, these measures quantify the probability that a classifier’s prediction is informed or marked versus chance, overcoming the known biases in accuracy, Recall, Precision, and F-measure when applied naïvely to imbalanced or multi-class tasks. Informedness assesses the probability that the prediction rule is informed about the true label; Markedness assesses the probability that the ground-truth class is marked by the classifier’s output. Both range in , where 0 indicates classification performance at chance level, positive values imply performance better than chance, and negative values indicate performance systematically worse than chance. These measures treat all classes symmetrically and correct for both prevalence (true class distribution) and prediction bias (label distribution) (Powers, 2020).
1. Formal Definition and Multi-class Generalization
Let be the confusion matrix, with denoting the count of true class predicted as class , and . For each class :
- , the number of true- instances
- , the number of predicted- instances
- , , ,
Define:
- (Recall) =
- (Precision) =
Then:
Aggregate measures:
- Micro-average (prevalence-weighted): ,
- Macro-average (uniform-weighted): ,
Prevalence-weighting is advocated for and bias-weighting for to ensure probabilistic interpretation (Powers, 2020).
2. Derivation and Intuitive Interpretation
In the binary setting, Informedness reduces to and Markedness to . For the multi-class case, each class is dichotomized (one-vs-rest); Informedness and Markedness are computed using the binary formulas. The aggregate reflects either the expected per-example (micro) or per-class (macro) performance.
- Informedness captures the probability a classifier’s guess is informed beyond chance, correcting for class imbalance and chance-level guesses.
- Markedness reflects the probability that the true label is marked by the prediction, correcting for labeling bias.
Zero corresponds to chance-level behavior; positive or negative values indicate better or worse than chance, respectively. This property is not shared by measures such as raw Recall or Precision, which can produce misleading values under class or label imbalance.
3. Relationships with Other Metrics
In the binary scenario:
- Matthews Correlation/Pearson’s :
- Area under the ROC curve (AUC):
- F1 Score: , which can be related to , bias, and prevalence.
For the multi-class case, a correlation measure can be defined as . Other contingency-matrix based measures (e.g., Cramer’s ) are also possible. However, and are sufficient to provide a two-dimensional summary of classification behavior (Powers, 2020).
4. Worked Example
Given a confusion matrix :
With , row-sums , column-sums , the following per-class Informedness and Markedness are obtained:
| Class () | Informedness | Markedness |
|---|---|---|
| 1 | 0.661 | 0.631 |
| 2 | 0.55 | 0.542 |
| 3 | 0.615 | 0.657 |
Micro-averages:
Macro-averages:
These values quantify, on the scale, how much better the classifier performs compared to chance, accounting for both row and column distributions (Powers, 2020).
5. Algorithmic Computation
Efficient computation of multi-class Informedness and Markedness is based directly on the confusion matrix. The routine operates in time for classes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
def multiDeltaP(C): N = np.sum(C) K = len(C) B_i = np.zeros(K) M_i = np.zeros(K) weight_r = np.zeros(K) weight_p = np.zeros(K) for i in range(K): TP = C[i][i] r = np.sum(C[i, :]) p = np.sum(C[:, i]) FP = p - TP FN = r - TP TN = N - TP - FP - FN TPR = TP / r if r > 0 else 0 FPR = FP / (N - r) if (N - r) > 0 else 0 PREC = TP / p if p > 0 else 0 INVPREC = TN / (N - p) if (N - p) > 0 else 0 B_i[i] = TPR - FPR M_i[i] = PREC + INVPREC - 1 weight_r[i] = r / N weight_p[i] = p / N B_micro = np.sum(weight_r * B_i) M_micro = np.sum(weight_p * M_i) B_macro = np.mean(B_i) M_macro = np.mean(M_i) return B_micro, M_micro, B_macro, M_macro |
This algorithm operates by class, dichotomizing each class to compute per-class Informedness and Markedness, and then aggregating using either micro- or macro-averaging.
6. Interpretation, Implications, and Use Cases
provides the expected informedness for a random true sample, while gives the expected markedness for a random predicted sample. Both measures are interpretable on the interval, enabling meaningful direct comparison across datasets of varying imbalance.
These statistics correct for chance-level behavior induced by class imbalance and prediction bias, treating all classes symmetrically. Macro-averaging may be preferred when each class is to be weighted equally, regardless of size. Both decompositions are useful for performance auditing and class-level error analysis in multi-class tasks.
Their closed-form relationship to ROC-AUC, F-measure, and correlation in the dichotomous case extends analytical insights, while their multi-class generalization preserves probabilistic interpretation—a distinguishing property among multi-class evaluation measures (Powers, 2020).
7. Context within Broader Evaluation Frameworks
Multi-class Informedness and Markedness are designed to address the limitation that standard metrics such as accuracy, Recall, Precision, and F-measure can be misleading under class or label imbalance, often inflating apparent performance in uninformative models. Unlike these conventional statistics, Informedness and Markedness have zero-value baselines at random guessing and robustly penalize both Type I and Type II errors, regardless of class prevalence.
A plausible implication is that their adoption may improve the robustness of model selection protocols and fairness in quantitative model benchmarking across datasets with heterogeneous class distributions. Their explicit correction for base-rate effects makes them particularly suitable for research requiring reliable inter-dataset or inter-model comparisons, especially in multiclass settings common in real-world machine learning applications.