Matthews Correlation Coefficient (MCC)

Updated 11 March 2026

MCC is a metric that computes the correlation between predicted and actual classes using all elements of the confusion matrix, ensuring balanced evaluation even in imbalanced datasets.
It offers symmetric treatment of classes by equally incorporating true positives, true negatives, false positives, and false negatives, thereby avoiding misleading results seen with accuracy or F1 score.
MCC is versatile, extending to multiclass and weighted settings, and is widely utilized in domains like medical imaging and cyberattack detection for reliable model benchmarking.

The Matthews Correlation Coefficient (MCC) is a scalar metric that quantifies the correlation between predicted and true labels in binary and multiclass classification tasks. Unlike accuracy, which can be misleading in imbalanced scenarios, MCC is considered a balanced measure because it incorporates all four cells of the confusion matrix—true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN)—with symmetric treatment of both classes. MCC has gained broad adoption as a primary evaluation measure, especially in domains where class imbalance is pervasive and the correct discrimination of both minority and majority classes is critical.

1. Formal Definition and Algebraic Properties

For the binary classification case, the Matthews Correlation Coefficient is given by

$\mathrm{MCC} = \frac{\mathrm{TP}\cdot\mathrm{TN} - \mathrm{FP}\cdot\mathrm{FN}}{\sqrt{(\mathrm{TP}+\mathrm{FP})(\mathrm{TP}+\mathrm{FN})(\mathrm{TN}+\mathrm{FP})(\mathrm{TN}+\mathrm{FN})}}$

This closed-form expression is the normalized covariance (Pearson correlation) between ground-truth and predicted label indicator vectors (Davari et al., 2021, Yao et al., 2020, Abhishek et al., 2020, Ferrer, 2022, Thiyagarajan et al., 22 Dec 2025). MCC takes values in $[-1, 1]$ , with $+1$ denoting perfect prediction, $0$ indicating random performance, and $-1$ corresponding to total systematic disagreement.

The numerator rewards correct same-class pairings (TP × TN) and penalizes mismatched errors (FP × FN). The denominator acts as a geometric mean that normalizes for different marginal counts, achieving symmetry under class labeling and invariance to class imbalance (Thiyagarajan et al., 22 Dec 2025).

If any of the marginal sums $(\mathrm{TP}+\mathrm{FP})$ , $(\mathrm{TP}+\mathrm{FN})$ , $(\mathrm{TN}+\mathrm{FP})$ , or $(\mathrm{TN}+\mathrm{FN})$ is zero, MCC is undefined (Yao et al., 2020), reflecting the lack of sufficient class variability for meaningful correlation.

2. Interpretation, Symmetry, and Robustness to Imbalance

MCC is equivalent to the Pearson product-moment correlation coefficient for binary variables (Abhishek et al., 2020, Itaya et al., 2024). Key interpretive features include:

Symmetric class treatment: Swapping positive and negative labels changes the sign but not the magnitude of MCC. The metric equally weighs performance on both classes (Yao et al., 2020, Davari et al., 2021, Abhishek et al., 2020).
Balance under extreme class skew: Because MCC combines TP, TN, FP, FN, it maintains interpretability whether a class is rare or common. Unlike accuracy, which is high if the majority class is always predicted, MCC remains zero for degenerate predictors, preventing artificial inflation under imbalance (Davari et al., 2021, Thiyagarajan et al., 22 Dec 2025).
Numerical edge cases: MCC is $+1$ if both FP and FN are zero (perfect prediction), $0$ if the numerator is zero (covariance vanishes), and $-1$ for total prediction inversion (Thiyagarajan et al., 22 Dec 2025, Itaya et al., 2024).

MCC penalizes trivial baseline classifiers that always predict the majority class, producing near-zero scores where accuracy would approach 1.0. Simulations and evaluations on real imbalanced datasets—including cyberattack detection and pixelwise segmentation—demonstrate that MCC reliably differentiates effective from ineffective models in contexts where F1 or accuracy are insensitive (Thiyagarajan et al., 22 Dec 2025, Davari et al., 2021).

3. Comparison with Other Binary Metrics

Metric	Incorporates TN?	Symmetry	Range	Imbalance Robustness
Accuracy	Yes	Yes	[0,1]	No
F1 / Dice	No	No	[0,1]	No
Precision, Recall	No	No	[0,1]	No
MCC	Yes	Yes	[-1,1]	Yes

F1 score and Dice coefficient depend solely on TP, FP, FN and disregard TN entirely, making them potentially misleading in datasets where true negatives dominate (Yao et al., 2020, Abhishek et al., 2020). The Fowlkes–Mallows (FM) score, the geometric mean of precision and recall, similarly omits TN. MCC, by contrast, is invariant to label permutations and robust to imbalance because it multiplies all four marginal sums in the denominator (Crall, 2023, Abhishek et al., 2020).

In highly imbalanced settings, confusion-matrix metrics that neglect TN, such as F1 or FM, become less reliable. Empirical studies show that F1 can produce the same value in fundamentally different confusion matrices, while MCC distinguishes these scenarios through the inclusion of TN (Yao et al., 2020). In the limit of infinite TN (e.g., object detection with huge negative space), MCC converges to FM, justifying the use of FM or precision-recall curves as practical surrogates when enumeration of TN is infeasible (Crall, 2023).

4. Generalization to Multiclass and Weighted Settings

Multiclass extensions of MCC are formulated as multi-variate Pearson correlations or through determinants of normalized confusion matrices, aiming to preserve balanced, symmetric evaluation across $K > 2$ categories (Jurman et al., 2010, Stoica et al., 2023, Itai et al., 2022).

Principal forms include:

Multiclass correlation (Gorodkin’s $R_K$ ): Generalizes the covariance formula via binary indicator matrices over the $K \times K$ confusion matrix (Stoica et al., 2023, Jurman et al., 2010).
Macro- and micro-averaged MCCs: Compute per-class (one-vs-rest) MCC values and average (macro), or aggregate all counts prior to a single MCC calculation (micro) (Tamura et al., 9 Mar 2025).
Geometric determinant-based MCC: Defines the generalized MCC as the determinant of a normalized confusion-matrix, interpreted as the hypervolume of an $n$ -dimensional parallelepiped in class space (Itai et al., 2022).
Enhanced metrics (EMPC, ER $_K$ , EMCC): Address shortcomings of previous extensions, ensuring conclusively negative values ( $-1$ ) for completely misclassified “hollow” matrices (Stoica et al., 2023).

Weighted versions, where each observation has a user-specified importance, adapt the classical MCC by replacing unweighted sums with inner products against a diagonal weight matrix. These are robust to perturbations in weights and yield greater sensitivity to high-importance items (Cortez et al., 23 Dec 2025).

5. Role in Machine Learning Workflows and Practical Case Studies

MCC is widely used as a model selection criterion, especially in domains characterized by severe label imbalance or in high-stakes applications such as medical image segmentation and cyberattack detection (Davari et al., 2021, Abhishek et al., 2020, Thiyagarajan et al., 22 Dec 2025). Key operational roles include:

Early stopping criterion in deep learning: Monitoring validation MCC in segmentation pipelines yields more reliable stopping points compared to cross-entropy loss, resulting in substantial improvements in overlap-based metrics (e.g., Dice) and reduced false positives (Davari et al., 2021).
As a loss function: Differentiable relaxations of MCC enable direct optimization in neural networks, outperforming Dice-based losses in image segmentation and producing superior specificity and sensitivity balance (Abhishek et al., 2020).
Model ranking and benchmarking: MCC consistently discriminates among classifiers with equivalent accuracy but differing ability to recognize minority-class examples, central in software defect prediction, intrusion detection, and other class-skewed challenges (Yao et al., 2020, Thiyagarajan et al., 22 Dec 2025).
Threshold optimization: Composite plots such as the MCC-F1 curve permit threshold selection and multi-threshold integration for binary classifiers, further extending use in practical performance analysis (Cao et al., 2020).

6. Statistical Inference and Confidence Intervals

The statistical estimation of MCC admits a rigorous asymptotic theory. Under i.i.d. sampling, the confusion-matrix counts follow a multinomial distribution and MCC, as a smooth function thereof, is asymptotically normal due to the delta method. Closed-form expressions for the variance are available, and Fisher’s $z$ -transformation improves the coverage and symmetry of confidence intervals, especially when MCC is near $\pm1$ or data are imbalanced (Itaya et al., 2024, Tamura et al., 9 Mar 2025).

In paired designs (two classifiers applied to the same instances), delta-method-based asymptotic intervals, Fisher-transformed bounds, and modified-log-ratio techniques permit direct inference on MCC differences, accounting for covariance between models’ predictions (Itaya et al., 2024, Tamura et al., 9 Mar 2025). These techniques ensure reliable comparison and address the unavailability of exact finite-sample distributions.

7. Limitations, Alternatives, and Recommendations

Despite its strengths, MCC is subject to several limitations:

Undefined in degenerate confusion matrices: If any marginal is zero, MCC cannot be computed. Practical implementations often add a small $\epsilon$ to stabilize computations (Yao et al., 2020, Abhishek et al., 2020).
Sensitivity to decision imbalance: For identical balanced error, MCC can prefer classifiers with more imbalanced prediction ratios, due to the structure of the normalization term (Ferrer, 2022).
Nontrivial extension to arbitrary cost or abstain frameworks: MCC is less transparent or adaptable than explicit expected-cost metrics for settings with nonuniform error costs or abstention (Ferrer, 2022).
Interpretation under massive imbalance: While MCC approaches the FM score as TN grows, interpretation in settings with nearly infinite negatives requires care (Crall, 2023).

Recommended practices include always reporting the full confusion matrix, employing MCC alongside or in place of accuracy, explicitly favoring macro-averaged MCC in multiclass and imbalanced cases, and leveraging robust inference approaches for statistically valid conclusions (Thiyagarajan et al., 22 Dec 2025, Davari et al., 2021, Tamura et al., 9 Mar 2025, Itaya et al., 2024). For datasets or applications where observation weights are crucial, weighted MCC variants and their multiclass analogues provide direct sensitivity to importance while maintaining robustness (Cortez et al., 23 Dec 2025).