Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier (2212.04382v2)
Abstract: Whether based on models, training data or a combination, classifiers place (possibly complex) input data into one of a relatively small number of output categories. In this paper, we study the structure of the boundary--those points for which a neighbor is classified differently--in the context of an input space that is a graph, so that there is a concept of neighboring inputs, The scientific setting is a model-based naive Bayes classifier for DNA reads produced by Next Generation Sequencers. We show that the boundary is both large and complicated in structure. We create a new measure of uncertainty, called Neighbor Similarity, that compares the result for a point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented, at a computational cost, for classifiers without inherent measures of uncertainty.
- Fragility indices for only sufficiently likely modifications. Proceedings of the National Academy of Sciences, 118(49):1–12.
- SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning. Genome Biology, 23:133.
- Classification and Regression Trees. CRC Press, Boca Raton, FL.
- MetaCompass: Reference-guided assembly of metagenomes. Preprint, bioRxiv, https://doi.org/10.1101/212506.
- A critical assessment of gene catalogs for metagenomic analysis. Bioinformatics. btab216.
- Algebraic algorithms for sampling from conditional distributions. Ann. Statist., 26:363–97.
- The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer–Verlag, New York.
- Holtgrewe, M. (2010). Mason: A read simulator for second generation sequencing data. Technical Report FU Berlin.
- Measuring quality of DNA sequence data via degradation. PLoS ONE, 2022:0221459. DOI: 10.1371/journal.pone.0221459.
- Application of Markov structure of genomes to outlier identification and read classification. Preprint. arXiv:2112.13117.
- Data quality: A statistical perspective. Statistical Methodology, 3(2):137–173.
- Multidimensional Scaling. SAGE, New York.
- Langdon, W. B. (2014). Mycoplasma contamination in the 1000 genomes project. BioData Mining, 7:3.
- Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88.
- Nikulin, M. S. (2001). Hellinger distance. In Encyclopedia of Mathematics. EMS Press, Berlin.
- Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research, 13:145–158.
- R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
- Reich, D. (2018). Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past. Vintage Books, New York.
- Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biology, 21(1):115.
- Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9):938–947.
- TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5(163).
- Statistical Analysis of Finite Mixture Distributions. Wiley, New York.
- Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16):5261–5267.