Limits to classification performance by relating Kullback-Leibler divergence to Cohen's Kappa (2403.01571v1)
Abstract: The performance of machine learning classification algorithms are evaluated by estimating metrics, often from the confusion matrix, using training data and cross-validation. However, these do not prove that the best possible performance has been achieved. Fundamental limits to error rates can be estimated using information distance measures. To this end, the confusion matrix has been formulated to comply with the Chernoff-Stein Lemma. This links the error rates to the Kullback-Leibler divergences between the probability density functions describing the two classes. This leads to a key result that relates Cohen's Kappa to the Resistor Average Distance which is the parallel resistor combination of the two Kullback-Leibler divergences. The Resistor Average Distance has units of bits and is estimated from the same training data used by the classification algorithm, using kNN estimates of the KullBack-Leibler divergences. The classification algorithm gives the confusion matrix and Kappa. Theory and methods are discussed in detail and then applied to Monte Carlo data and real datasets. Four very different real datasets - Breast Cancer, Coronary Heart Disease, Bankruptcy, and Particle Identification - are analysed, with both continuous and discrete values, and their classification performance compared to the expected theoretical limit. In all cases this analysis shows that the algorithms could not have performed any better due to the underlying probability density functions for the two classes. Important lessons are learnt on how to predict the performance of algorithms for imbalanced data using training datasets that are approximately balanced. Machine learning is very powerful but classification performance ultimately depends on the quality of the data and the relevance of the variables to the problem.
- E. Frank, M. Hall, I. Witten (2016), “ The WEKA Workbench” , Morgan Kaufmann, Fourth Edition
- A. Webb and K. Copsey (2011), “Statistical Pattern Recognition”, John Wiley and Sons, Third Edition.
- L. Wasserman (2004), “ All of statistics ”, Springer texts in statistics.
- S. Kullback (1997), “ Information theory and statistics”, Dover Publications Inc.
- H. Chernoff, “ Large-sample theory: Parametric Case” (1956) Ann. Math. Stat. 27(1), 1-22.
- H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations” (1952) Ann. Math. Stat. 23(4), 493-507.
- S. Sinanovic and D. Johnson, “Toward a theory of information processing”, Signal Processing (2007) 87(6), 1326-1344.
- A. Bhattacharyya, “On a Measure of Divergence between Two Multinomial Populations “, (1946) Indian Journal Stats. 7(4), 401-406.
- L. Nieto and A. Correndo, “Classification performance metrics and indices” (2023), R project doumentation, https://cran.r-project.org/web/packages/metrica/vignettes/available_metrics_classification.html, Accessed 4 December 2023.
- J. Cohen, “ A coefficient of agreement for nominal scales” (1960) Educational and Psychological Measurement, XX(1), 37-46.
- D. McAllester and K. Stratos, “ Formal limitations on the measurement of mutual information”, (2020), Proceedings 23rd AISTATS Conference, Palermo, Vol 108.
- T. van Erven and P. Harresmoes, “Renyi Divergence and Kullback-Leibler Divergence” (2014) 60(7) 3797-3820.
- G. E. Crooks, “ On measures of entropy and information”, (2021) Tech. Note 009 v0.8, http://threeplusone.com/info, Accessed 4 December 2023.
- L. Kozachenko and N. Leonenko , “ Sample Estimate of the Entropy of a Random Vector ” (1987) Problems Inform. Transmission 23:2, 95-101.
- Lisa Crow, “A novel approach to estimating information-theoretic measures for exploratory data analysis and explainable machine learning” (2022) PhD Thesis, University of Manchester, https://www.escholar.manchester.ac.uk/uk-ac-man-scw:332338.
- S J. Watts & L. Crow, “ Big Variates: Visualizing and identifying key variables in a multivariate world” (2019) Nuclear Instruments and Methods in Physics Research A: 940, 441-447.
- S. J. Watts and L. Crow, “ The Shannon Entropy of a Histogram “ (2022) https://doi.org/10.48550/arXiv.2210.02848
- W. N. Street, W. H. Wolberg and O. L. Mangasarian. “ Nuclear feature extraction for breast tumor diagnosis “ (1993) IS&T/SPIE International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA.
- Kim and Han, “ The Discovery of Experts’ Decision Rules from Quantitative Data Using Genetic Algorithms “ (2003) Expert Systems with Applications, 25, 637-646.
- R. Bellman, “Adaptive Control Processes” (1961) Princeton University Press.
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. “ When is ’nearest neighbor’ meaningful ? ” (1999) Proc. Int. Conf. Database Theory, 217-235
- Data from https://www.cancerresearchuk.org/, accessed on 22 January 2024.
- A. Fernandez, S. Garcia, M. Galar, R. Prati, B. Krawczyk, F. Herrera, “Learning from Imbalanced Data Sets” (2018), Springer.
- D. G. Altman “ Practical statistics for medical research ” (1991) Chapman and Hall, London, pp. 404
- T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” (2009) Second Edition (Springer Series in Statistics)