Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 21 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 469 tok/s Pro
Kimi K2 181 tok/s Pro
2000 character limit reached

Supervised Contamination Detection, with Flow Cytometry Application (2404.06093v1)

Published 9 Apr 2024 in stat.ME

Abstract: The contamination detection problem aims to determine whether a set of observations has been contaminated, i.e. whether it contains points drawn from a distribution different from the reference distribution. Here, we consider a supervised problem, where labeled samples drawn from both the reference distribution and the contamination distribution are available at training time. This problem is motivated by the detection of rare cells in flow cytometry. Compared to novelty detection problems or two-sample testing, where only samples from the reference distribution are available, the challenge lies in efficiently leveraging the observations from the contamination detection to design more powerful tests. In this article, we introduce a test for the supervised contamination detection problem. We provide non-asymptotic guarantees on its Type I error, and characterize its detection rate. The test relies on estimating reference and contamination densities using histograms, and its power depends strongly on the choice of the corresponding partition. We present an algorithm for judiciously choosing the partition that results in a powerful test. Simulations illustrate the good empirical performances of our partition selection algorithm and the efficiency of our test. Finally, we showcase our method and apply it to a real flow cytometry dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Charu C. Aggarwal. An Introduction to Outlier Analysis, pages 1–34. Springer International Publishing, Cham, 2017.
  2. Maximum likelihood with bias-corrected calibration is hard-to-beat at label shift adaptation. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  3. Semi-supervised novelty detection. Journal of Machine Learning Research, 11:2973–3009, 03 2010.
  4. L. Bordes and P. Vandekerkhove. Semiparametric two-component mixture model with a known component: An asymptotically normal estimator. Mathematical Methods of Statistics, 19(1):22–41, 2010.
  5. Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing. The Annals of Statistics, 38(1):100 – 145, 2010.
  6. A cross-validation based estimation of the proportion of true null hypotheses. Journal of Statistical Planning and Inference, 140(11):3132–3147, 2010.
  7. Label shift quantification with robustness guarantees via distribution feature matching. In Machine Learning and Knowledge Discovery in Databases: Research Track: European Conference, ECML PKDD 2023, Turin, Italy, September 18–22, 2023, Proceedings, Part V, page 69–85, Berlin, Heidelberg, 2023. Springer-Verlag.
  8. G. Enderlein. Hawkins, d. m.: Identification of outliers. chapman and hall, london – new york 1980, 188 s., £ 14, 50. Biometrical Journal, 29:198–198, 1987.
  9. Optimal single-class classification strategies. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.
  10. George Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
  11. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
  12. Class distribution estimation based on the hellinger distance. Information Sciences, 218:146–164, 2013.
  13. A review on quantification learning. ACM Comput. Surv., 50(5), sep 2017.
  14. Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2015.
  15. A unified view of label shift estimation. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc.
  16. Maximum mean discrepancy for class ratio estimation: Convergence bounds and kernel selection. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 530–538, Bejing, China, 22–24 Jun 2014. PMLR.
  17. A comparison framework and guideline of clustering methods for mass cytometry data. Genome Biology, 20(1):297, 2019.
  18. Detecting and correcting for label shift with black box predictors. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 3128–3136. PMLR, 2018.
  19. Two-sample contamination model test. Bernoulli, 30(1):170 – 197, 2024.
  20. On efficient estimators of the proportion of true null hypotheses in a multiple testing setup. Scandinavian Journal of Statistics, 41(4):1167–1194, 2014.
  21. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021.
  22. rpart: Recursive Partitioning and Regression Trees, 2022. R package version 4.1.19.
  23. In vivo flow cytometry of extremely rare circulating cells. Scientific Reports, 9(1):3366, 2019.
  24. A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
  25. Detection of rare objects by flow cytometry: Imaging, cell sorting, and deep learning approaches. International Journal of Molecular Sciences, 21(7), 2020.
  26. Semi-supervised anomaly detection algorithms: A comparative summary and future research directions. Knowledge-Based Systems, 218:106878, 2021.
  27. Consistency and convergence rates of one-class svms and related algorithms. Journal of Machine Learning Research, 7(29):817–854, 2006.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com