Papers
Topics
Authors
Recent
2000 character limit reached

Multi-Class Informedness & Markedness Metrics

Updated 21 December 2025
  • Multi-Class Informedness and Markedness are evaluation metrics that generalize binary measures to robustly assess classifier performance in imbalanced multi-class settings.
  • They compute per-class informedness (TPR minus FPR) and markedness (Precision plus InversePrecision minus 1) by decomposing the confusion matrix with micro- and macro-averaging.
  • Their design ensures symmetric treatment of all classes and offers a chance-corrected, probabilistic interpretation essential for reliable model evaluation.

Multi-class Informedness (ΔP′) and Markedness (ΔP) are evaluation metrics designed to generalize their binary counterparts to the multi-class classification setting. Originating from the need for metrics that are unbiased with respect to base rates and label distributions, these measures quantify the probability that a classifier’s prediction is informed or marked versus chance, overcoming the known biases in accuracy, Recall, Precision, and F-measure when applied naïvely to imbalanced or multi-class tasks. Informedness assesses the probability that the prediction rule is informed about the true label; Markedness assesses the probability that the ground-truth class is marked by the classifier’s output. Both range in [1,+1][-1, +1], where 0 indicates classification performance at chance level, positive values imply performance better than chance, and negative values indicate performance systematically worse than chance. These measures treat all KK classes symmetrically and correct for both prevalence (true class distribution) and prediction bias (label distribution) (Powers, 2020).

1. Formal Definition and Multi-class Generalization

Let CC be the K×KK \times K confusion matrix, with CijC_{ij} denoting the count of true class ii predicted as class jj, and N=i=1Kj=1KCijN = \sum_{i=1}^K \sum_{j=1}^K C_{ij}. For each class ii:

  • ri=jCijr_i = \sum_j C_{ij}, the number of true-ii instances
  • pi=jCjip_i = \sum_j C_{ji}, the number of predicted-ii instances
  • TPi=CiiTP_i = C_{ii}, FPi=jiCjiFP_i = \sum_{j \neq i} C_{ji}, FNi=jiCijFN_i = \sum_{j \neq i} C_{ij}, TNi=NTPiFPiFNiTN_i = N - TP_i - FP_i - FN_i

Define:

  • TPRiTPR_i (Recalli_i) = TPi/riTP_i/r_i
  • FPRi=FPi/(Nri)FPR_i = FP_i/(N - r_i)
  • PRECiPREC_i (Precisioni_i) = TPi/piTP_i/p_i
  • INVPRECi=TNi/(Npi)INVPREC_i = TN_i/(N - p_i)

Then:

  • Informednessi=TPRiFPRi\text{Informedness}_i = TPR_i - FPR_i
  • Markednessi=PRECi+INVPRECi1\text{Markedness}_i = PREC_i + INVPREC_i - 1

Aggregate measures:

  • Micro-average (prevalence-weighted): Bmicro=i=1KriNInformednessiB_{\text{micro}} = \sum_{i=1}^K \frac{r_i}{N} \cdot \text{Informedness}_i, Mmicro=i=1KpiNMarkednessiM_{\text{micro}} = \sum_{i=1}^K \frac{p_i}{N} \cdot \text{Markedness}_i
  • Macro-average (uniform-weighted): Bmacro=1Ki=1KInformednessiB_{\text{macro}} = \frac{1}{K}\sum_{i=1}^K \text{Informedness}_i, Mmacro=1Ki=1KMarkednessiM_{\text{macro}} = \frac{1}{K}\sum_{i=1}^K \text{Markedness}_i

Prevalence-weighting is advocated for BmicroB_{\text{micro}} and bias-weighting for MmicroM_{\text{micro}} to ensure probabilistic interpretation (Powers, 2020).

2. Derivation and Intuitive Interpretation

In the binary setting, Informedness reduces to TPRFPRTPR - FPR and Markedness to Precision+InversePrecision1Precision + InversePrecision - 1. For the multi-class case, each class ii is dichotomized (one-vs-rest); Informednessi_i and Markednessi_i are computed using the binary formulas. The aggregate reflects either the expected per-example (micro) or per-class (macro) performance.

  • Informedness captures the probability a classifier’s guess is informed beyond chance, correcting for class imbalance and chance-level guesses.
  • Markedness reflects the probability that the true label is marked by the prediction, correcting for labeling bias.

Zero corresponds to chance-level behavior; positive or negative values indicate better or worse than chance, respectively. This property is not shared by measures such as raw Recall or Precision, which can produce misleading values under class or label imbalance.

3. Relationships with Other Metrics

In the binary scenario:

  • Matthews Correlation/Pearson’s ρ\rho: ρ=sign(ΔPΔP)ΔPΔP\rho = \operatorname{sign}(\Delta P′ \cdot \Delta P) \sqrt{|\Delta P′ \cdot \Delta P|}
  • Area under the ROC curve (AUC): AUC=(ΔP+1)/2AUC = (\Delta P′ + 1)/2
  • F1 Score: F1=2PrecisionRecall/(Precision+Recall)F_1 = 2 \cdot Precision \cdot Recall / (Precision + Recall), which can be related to ΔP\Delta P′, bias, and prevalence.

For the multi-class case, a correlation measure can be defined as ρmulti=sign(BmicroMmicro)BmicroMmicro\rho_{\text{multi}} = \operatorname{sign}(B_{\text{micro}} \cdot M_{\text{micro}}) \sqrt{|B_{\text{micro}} \cdot M_{\text{micro}}|}. Other contingency-matrix based measures (e.g., Cramer’s VV) are also possible. However, BmicroB_{\text{micro}} and MmicroM_{\text{micro}} are sufficient to provide a two-dimensional summary of classification behavior (Powers, 2020).

4. Worked Example

Given a 3×33 \times 3 confusion matrix CC:

C=[1021 3152 1412]C = \begin{bmatrix}10 & 2 & 1 \ 3 & 15 & 2 \ 1 & 4 & 12\end{bmatrix}

With N=50N = 50, row-sums r=[13,20,17]r = [13, 20, 17], column-sums p=[14,21,15]p = [14, 21, 15], the following per-class Informedness and Markedness are obtained:

Class (ii) Informednessi_i Markednessi_i
1 0.661 0.631
2 0.55 0.542
3 0.615 0.657

Micro-averages:

  • Bmicro=0.601B_{\text{micro}} = 0.601
  • Mmicro=0.601M_{\text{micro}} = 0.601

Macro-averages:

  • Bmacro0.609B_{\text{macro}} \approx 0.609
  • Mmacro0.610M_{\text{macro}} \approx 0.610

These values quantify, on the [1,1][-1, 1] scale, how much better the classifier performs compared to chance, accounting for both row and column distributions (Powers, 2020).

5. Algorithmic Computation

Efficient computation of multi-class Informedness and Markedness is based directly on the confusion matrix. The routine operates in O(K2)O(K^2) time for KK classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def multiDeltaP(C):
    N = np.sum(C)
    K = len(C)
    B_i = np.zeros(K)
    M_i = np.zeros(K)
    weight_r = np.zeros(K)
    weight_p = np.zeros(K)
    for i in range(K):
        TP = C[i][i]
        r = np.sum(C[i, :])
        p = np.sum(C[:, i])
        FP = p - TP
        FN = r - TP
        TN = N - TP - FP - FN
        TPR = TP / r if r > 0 else 0
        FPR = FP / (N - r) if (N - r) > 0 else 0
        PREC = TP / p if p > 0 else 0
        INVPREC = TN / (N - p) if (N - p) > 0 else 0
        B_i[i] = TPR - FPR
        M_i[i] = PREC + INVPREC - 1
        weight_r[i] = r / N
        weight_p[i] = p / N
    B_micro = np.sum(weight_r * B_i)
    M_micro = np.sum(weight_p * M_i)
    B_macro = np.mean(B_i)
    M_macro = np.mean(M_i)
    return B_micro, M_micro, B_macro, M_macro

This algorithm operates by class, dichotomizing each class to compute per-class Informedness and Markedness, and then aggregating using either micro- or macro-averaging.

6. Interpretation, Implications, and Use Cases

BmicroB_{\text{micro}} provides the expected informedness for a random true sample, while MmicroM_{\text{micro}} gives the expected markedness for a random predicted sample. Both measures are interpretable on the [1,1][-1,1] interval, enabling meaningful direct comparison across datasets of varying imbalance.

These statistics correct for chance-level behavior induced by class imbalance and prediction bias, treating all KK classes symmetrically. Macro-averaging may be preferred when each class is to be weighted equally, regardless of size. Both decompositions are useful for performance auditing and class-level error analysis in multi-class tasks.

Their closed-form relationship to ROC-AUC, F-measure, and correlation in the dichotomous case extends analytical insights, while their multi-class generalization preserves probabilistic interpretation—a distinguishing property among multi-class evaluation measures (Powers, 2020).

7. Context within Broader Evaluation Frameworks

Multi-class Informedness and Markedness are designed to address the limitation that standard metrics such as accuracy, Recall, Precision, and F-measure can be misleading under class or label imbalance, often inflating apparent performance in uninformative models. Unlike these conventional statistics, Informedness and Markedness have zero-value baselines at random guessing and robustly penalize both Type I and Type II errors, regardless of class prevalence.

A plausible implication is that their adoption may improve the robustness of model selection protocols and fairness in quantitative model benchmarking across datasets with heterogeneous class distributions. Their explicit correction for base-rate effects makes them particularly suitable for research requiring reliable inter-dataset or inter-model comparisons, especially in multiclass settings common in real-world machine learning applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi-Class Informedness and Markedness.