Estimating Unknown Population Sizes Using the Hypergeometric Distribution
Abstract: The multivariate hypergeometric distribution describes sampling without replacement from a discrete population of elements divided into multiple categories. Addressing a gap in the literature, we tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown. Here, we propose a novel solution using the hypergeometric likelihood to solve this estimation challenge, even in the presence of severe under-sampling. We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable, such as with collaborative filtering, using the variational autoencoder framework. Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data, both in terms of accuracy of population size estimate and in its ability to learn an informative latent space. We demonstrate our method's versatility through applications in NLP, by inferring and estimating the complexity of latent vocabularies in text excerpts, and in biology, by accurately recovering the true number of gene transcripts from sparse single-cell genomics data.
- On the benefits of maximum likelihood estimation for regression and forecasting. In International Conference on Learning Representations, 2022.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
- Advances in readability research: A new readability web app for english. In 2022 International Conference on Advanced Learning Technologies (ICALT), pp. 1–5. IEEE, 2022.
- A large-scaled corpus for assessing text readability. Behavior Research Methods, 55(2):491–507, 2023.
- Darroch, J. N. The multiple-recapture census: I. estimation of a closed population. Biometrika, 1958.
- Über die Statistik verketteter Vorgänge. ZAMM - Zeitschrift für Angewandte Mathematik und Mechanik, 3(4), 1923.
- A review of multivariate distributions for count data derived from the poisson distribution. Wiley Interdisciplinary Reviews: Computational Statistics, 9(3):e1398, 2017.
- Auto-encoding variational bayes. arXiv preprint, 2014.
- Quantification and statistical modeling of droplet-based single-nucleus RNA-sequencing data. Biostatistics, May 2023.
- Lambert, D. Zero-inflated poisson regression, with an application to defects in manufacturing. Technometrics, 34(1):1–14, 1992.
- Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, pp. 689–698, 2018.
- Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12), December 2018.
- Moivre, A. De mensura sortis, seu, de probabilitate eventuum in ludis a casu fortuito pendentibus. 1711.
- Maximum Likelihood Estimation of a Multivariate Hypergeometric Distribution. Sankhyā: The Indian Journal of Statistics, Series B (1960-2002), 49(2), 1987.
- On the use of the adjusted rand index as a metric for evaluating supervised classification. In International conference on artificial neural networks, pp. 175–184. Springer, 2009.
- Learning Group Importance using the Differentiable Hypergeometric Distribution. 2022.
- The Estimation of Parameters of the Hypergeometric Distribution and Its Application to the Software Reliability Growth Model. IEEE Transactions on Software Engineering, 17(5), May 1991.
- Confidence sequences for sampling without replacement. In Advances in Neural Information Processing Systems, volume 33, 2020.
- Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data. Nature communications, 12(1):5261, 2021.
- Molecular spikes: a gold standard for single-cell RNA counting. Nature Methods, 19, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.