Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Structure of Classifier Boundaries: Case Study for a Naive Bayes Classifier (2212.04382v2)

Published 8 Dec 2022 in stat.ML and cs.LG

Abstract: Whether based on models, training data or a combination, classifiers place (possibly complex) input data into one of a relatively small number of output categories. In this paper, we study the structure of the boundary--those points for which a neighbor is classified differently--in the context of an input space that is a graph, so that there is a concept of neighboring inputs, The scientific setting is a model-based naive Bayes classifier for DNA reads produced by Next Generation Sequencers. We show that the boundary is both large and complicated in structure. We create a new measure of uncertainty, called Neighbor Similarity, that compares the result for a point to the distribution of results for its neighbors. This measure not only tracks two inherent uncertainty measures for the Bayes classifier, but also can be implemented, at a computational cost, for classifiers without inherent measures of uncertainty.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Fragility indices for only sufficiently likely modifications. Proceedings of the National Academy of Sciences, 118(49):1–12.
  2. SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning. Genome Biology, 23:133.
  3. Classification and Regression Trees. CRC Press, Boca Raton, FL.
  4. MetaCompass: Reference-guided assembly of metagenomes. Preprint, bioRxiv, https://doi.org/10.1101/212506.
  5. A critical assessment of gene catalogs for metagenomic analysis. Bioinformatics. btab216.
  6. Algebraic algorithms for sampling from conditional distributions. Ann. Statist., 26:363–97.
  7. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer–Verlag, New York.
  8. Holtgrewe, M. (2010). Mason: A read simulator for second generation sequencing data. Technical Report FU Berlin.
  9. Measuring quality of DNA sequence data via degradation. PLoS ONE, 2022:0221459. DOI: 10.1371/journal.pone.0221459.
  10. Application of Markov structure of genomes to outlier identification and read classification. Preprint. arXiv:2112.13117.
  11. Data quality: A statistical perspective. Statistical Methodology, 3(2):137–173.
  12. Multidimensional Scaling. SAGE, New York.
  13. Langdon, W. B. (2014). Mycoplasma contamination in the 1000 genomes project. BioData Mining, 7:3.
  14. Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88.
  15. Nikulin, M. S. (2001). Hellinger distance. In Encyclopedia of Mathematics. EMS Press, Berlin.
  16. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Research, 13:145–158.
  17. R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  18. Reich, D. (2018). Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past. Vintage Books, New York.
  19. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biology, 21(1):115.
  20. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environmental Microbiology, 6(9):938–947.
  21. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics, 5(163).
  22. Statistical Analysis of Finite Mixture Distributions. Wiley, New York.
  23. Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16):5261–5267.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com