Papers
Topics
Authors
Recent
Search
2000 character limit reached

Estimating Unknown Population Sizes Using the Hypergeometric Distribution

Published 22 Feb 2024 in cs.LG, stat.ME, and stat.ML | (2402.14220v2)

Abstract: The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. On the benefits of maximum likelihood estimation for regression and forecasting. In International Conference on Learning Representations, 2022.
  2. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
  3. Advances in readability research: A new readability web app for english. In 2022 International Conference on Advanced Learning Technologies (ICALT), pp.  1–5. IEEE, 2022.
  4. A large-scaled corpus for assessing text readability. Behavior Research Methods, 55(2):491–507, 2023.
  5. Darroch, J. N. The multiple-recapture census: I. estimation of a closed population. Biometrika, 1958.
  6. Über die Statistik verketteter Vorgänge. ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik, 3(4), 1923.
  7. A review of multivariate distributions for count data derived from the poisson distribution. Wiley Interdisciplinary Reviews: Computational Statistics, 9(3):e1398, 2017.
  8. Auto-encoding variational bayes. arXiv preprint, 2014.
  9. Quantification and statistical modeling of droplet-based single-nucleus RNA-sequencing data. Biostatistics, May 2023.
  10. Lambert, D. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34(1):1–14, 1992.
  11. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, pp.  689–698, 2018.
  12. Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), December 2018.
  13. Moivre, A. De mensura sortis, seu, de probabilitate eventuum in ludis a casu fortuito pendentibus. 1711.
  14. Maximum Likelihood Estimation of a Multivariate Hypergeometric Distribution. Sankhyā: The Indian Journal of Statistics, Series B (1960-2002), 49(2), 1987.
  15. On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks, pp.  175–184. Springer, 2009.
  16. Learning Group Importance using the Differentiable Hypergeometric Distribution. 2022.
  17. The Estimation of Parameters of the Hypergeometric Distribution and Its Application to the Software Reliability Growth Model. IEEE Transactions on Software Engineering, 17(5), May 1991.
  18. Confidence sequences for sampling without replacement. In Advances in Neural Information Processing Systems, volume 33, 2020.
  19. Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data. Nature communications, 12(1):5261, 2021.
  20. Molecular spikes: a gold standard for single-cell RNA counting. Nature Methods, 19, 2022.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.