Multi-Class Informedness & Markedness Metrics

Updated 21 December 2025

Multi-Class Informedness and Markedness are evaluation metrics that generalize binary measures to robustly assess classifier performance in imbalanced multi-class settings.
They compute per-class informedness (TPR minus FPR) and markedness (Precision plus InversePrecision minus 1) by decomposing the confusion matrix with micro- and macro-averaging.
Their design ensures symmetric treatment of all classes and offers a chance-corrected, probabilistic interpretation essential for reliable model evaluation.

Multi-class Informedness (ΔP′) and Markedness (ΔP) are evaluation metrics designed to generalize their binary counterparts to the multi-class classification setting. Originating from the need for metrics that are unbiased with respect to base rates and label distributions, these measures quantify the probability that a classifier’s prediction is informed or marked versus chance, overcoming the known biases in accuracy, Recall, Precision, and F-measure when applied naïvely to imbalanced or multi-class tasks. Informedness assesses the probability that the prediction rule is informed about the true label; Markedness assesses the probability that the ground-truth class is marked by the classifier’s output. Both range in $[-1, +1]$ , where 0 indicates classification performance at chance level, positive values imply performance better than chance, and negative values indicate performance systematically worse than chance. These measures treat all $K$ classes symmetrically and correct for both prevalence (true class distribution) and prediction bias (label distribution) (Powers, 2020).

1. Formal Definition and Multi-class Generalization

Let $C$ be the $K \times K$ confusion matrix, with $C_{ij}$ denoting the count of true class $i$ predicted as class $j$ , and $N = \sum_{i=1}^K \sum_{j=1}^K C_{ij}$ . For each class $i$ :

$r_i = \sum_j C_{ij}$ , the number of true- $i$ instances
$p_i = \sum_j C_{ji}$ , the number of predicted- $i$ instances
$TP_i = C_{ii}$ , $FP_i = \sum_{j \neq i} C_{ji}$ , $FN_i = \sum_{j \neq i} C_{ij}$ , $TN_i = N - TP_i - FP_i - FN_i$

Define:

$TPR_i$ (Recall $_i$ ) = $TP_i/r_i$
$FPR_i = FP_i/(N - r_i)$
$PREC_i$ (Precision $_i$ ) = $TP_i/p_i$
$INVPREC_i = TN_i/(N - p_i)$

Then:

$\text{Informedness}_i = TPR_i - FPR_i$
$\text{Markedness}_i = PREC_i + INVPREC_i - 1$

Aggregate measures:

Micro-average (prevalence-weighted): $B_{\text{micro}} = \sum_{i=1}^K \frac{r_i}{N} \cdot \text{Informedness}_i$ , $M_{\text{micro}} = \sum_{i=1}^K \frac{p_i}{N} \cdot \text{Markedness}_i$
Macro-average (uniform-weighted): $B_{\text{macro}} = \frac{1}{K}\sum_{i=1}^K \text{Informedness}_i$ , $M_{\text{macro}} = \frac{1}{K}\sum_{i=1}^K \text{Markedness}_i$

Prevalence-weighting is advocated for $B_{\text{micro}}$ and bias-weighting for $M_{\text{micro}}$ to ensure probabilistic interpretation (Powers, 2020).

2. Derivation and Intuitive Interpretation

In the binary setting, Informedness reduces to $TPR - FPR$ and Markedness to $Precision + InversePrecision - 1$ . For the multi-class case, each class $i$ is dichotomized (one-vs-rest); Informedness $_i$ and Markedness $_i$ are computed using the binary formulas. The aggregate reflects either the expected per-example (micro) or per-class (macro) performance.

Informedness captures the probability a classifier’s guess is informed beyond chance, correcting for class imbalance and chance-level guesses.
Markedness reflects the probability that the true label is marked by the prediction, correcting for labeling bias.

Zero corresponds to chance-level behavior; positive or negative values indicate better or worse than chance, respectively. This property is not shared by measures such as raw Recall or Precision, which can produce misleading values under class or label imbalance.

3. Relationships with Other Metrics

In the binary scenario:

Matthews Correlation/Pearson’s $\rho$ : $\rho = \operatorname{sign}(\Delta P′ \cdot \Delta P) \sqrt{|\Delta P′ \cdot \Delta P|}$
Area under the ROC curve (AUC): $AUC = (\Delta P′ + 1)/2$
F1 Score: $F_1 = 2 \cdot Precision \cdot Recall / (Precision + Recall)$ , which can be related to $\Delta P′$ , bias, and prevalence.

For the multi-class case, a correlation measure can be defined as $\rho_{\text{multi}} = \operatorname{sign}(B_{\text{micro}} \cdot M_{\text{micro}}) \sqrt{|B_{\text{micro}} \cdot M_{\text{micro}}|}$ . Other contingency-matrix based measures (e.g., Cramer’s $V$ ) are also possible. However, $B_{\text{micro}}$ and $M_{\text{micro}}$ are sufficient to provide a two-dimensional summary of classification behavior (Powers, 2020).

4. Worked Example

Given a $3 \times 3$ confusion matrix $C$ :

$C = \begin{bmatrix}10 & 2 & 1 \ 3 & 15 & 2 \ 1 & 4 & 12\end{bmatrix}$

With $N = 50$ , row-sums $r = [13, 20, 17]$ , column-sums $p = [14, 21, 15]$ , the following per-class Informedness and Markedness are obtained:

Class ( $i$ )	Informedness $_i$	Markedness $_i$
1	0.661	0.631
2	0.55	0.542
3	0.615	0.657

Micro-averages:

$B_{\text{micro}} = 0.601$
$M_{\text{micro}} = 0.601$

Macro-averages:

$B_{\text{macro}} \approx 0.609$
$M_{\text{macro}} \approx 0.610$

These values quantify, on the $[-1, 1]$ scale, how much better the classifier performs compared to chance, accounting for both row and column distributions (Powers, 2020).

5. Algorithmic Computation

Efficient computation of multi-class Informedness and Markedness is based directly on the confusion matrix. The routine operates in $O(K^2)$ time for $K$ classes:

def multiDeltaP(C):
    N = np.sum(C)
    K = len(C)
    B_i = np.zeros(K)
    M_i = np.zeros(K)
    weight_r = np.zeros(K)
    weight_p = np.zeros(K)
    for i in range(K):
        TP = C[i][i]
        r = np.sum(C[i, :])
        p = np.sum(C[:, i])
        FP = p - TP
        FN = r - TP
        TN = N - TP - FP - FN
        TPR = TP / r if r > 0 else 0
        FPR = FP / (N - r) if (N - r) > 0 else 0
        PREC = TP / p if p > 0 else 0
        INVPREC = TN / (N - p) if (N - p) > 0 else 0
        B_i[i] = TPR - FPR
        M_i[i] = PREC + INVPREC - 1
        weight_r[i] = r / N
        weight_p[i] = p / N
    B_micro = np.sum(weight_r * B_i)
    M_micro = np.sum(weight_p * M_i)
    B_macro = np.mean(B_i)
    M_macro = np.mean(M_i)
    return B_micro, M_micro, B_macro, M_macro

This algorithm operates by class, dichotomizing each class to compute per-class Informedness and Markedness, and then aggregating using either micro- or macro-averaging.

6. Interpretation, Implications, and Use Cases

$B_{\text{micro}}$ provides the expected informedness for a random true sample, while $M_{\text{micro}}$ gives the expected markedness for a random predicted sample. Both measures are interpretable on the $[-1,1]$ interval, enabling meaningful direct comparison across datasets of varying imbalance.

These statistics correct for chance-level behavior induced by class imbalance and prediction bias, treating all $K$ classes symmetrically. Macro-averaging may be preferred when each class is to be weighted equally, regardless of size. Both decompositions are useful for performance auditing and class-level error analysis in multi-class tasks.

Their closed-form relationship to ROC-AUC, F-measure, and correlation in the dichotomous case extends analytical insights, while their multi-class generalization preserves probabilistic interpretation—a distinguishing property among multi-class evaluation measures (Powers, 2020).

7. Context within Broader Evaluation Frameworks

Multi-class Informedness and Markedness are designed to address the limitation that standard metrics such as accuracy, Recall, Precision, and F-measure can be misleading under class or label imbalance, often inflating apparent performance in uninformative models. Unlike these conventional statistics, Informedness and Markedness have zero-value baselines at random guessing and robustly penalize both Type I and Type II errors, regardless of class prevalence.

A plausible implication is that their adoption may improve the robustness of model selection protocols and fairness in quantitative model benchmarking across datasets with heterogeneous class distributions. Their explicit correction for base-rate effects makes them particularly suitable for research requiring reliable inter-dataset or inter-model comparisons, especially in multiclass settings common in real-world machine learning applications.

PDF Markdown Chat (Pro)

References (1)

Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-Class Informedness and Markedness.