Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Testing Support Size More Efficiently Than Learning Histograms (2410.18915v2)

Published 24 Oct 2024 in cs.DS and cs.LG

Abstract: Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$\epsilon$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $\Theta(\tfrac{n}{\epsilon2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{\epsilon \log n} \log(1/\epsilon))$ samples, nearly matching the best known lower bound of $\Omega(\tfrac{n}{\epsilon \log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In Proceedings of the International Conference on Machine Learning (ICML), volume 70, pages 11–21, 2017.
  2. Refining the adaptivity notion in the huge object model. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), 2024.
  3. Support testing in the huge object model. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), 2024.
  4. Distribution testing lower bounds via reductions from communication complexity. ACM Transactions on Computation Theory (TOCT), 11(2):1–37, 2019.
  5. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  6. Estimating the number of species: a review. Journal of the American statistical Association, 88(421):364–373, 1993.
  7. VC dimension and distribution-free sample-based testing. In Proceedings of the ACM SIGACT Symposium on Theory of Computing (STOC), 2021.
  8. Clément Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, 2020.
  9. Testing for families of distributions via the fourier transform. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  10. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology, pages 42–58, 1943.
  11. Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution testing under the parity trace. arXiv preprint arXiv:2304.01374, 2023.
  12. Property testing and its connection to learning and approximation. Journal of the ACM (JACM), 45(4):653–750, 1998.
  13. Oded Goldreich. Introduction to property testing. Cambridge University Press, 2017.
  14. Oded Goldreich. On the complexity of estimating the effective support size. ECCC, TR19-088, 2019.
  15. Oded Goldreich. Testing bipartitness in an augmented vdf bounded-degree graph model. arXiv preprint arXiv:1905.03070, 2019.
  16. Leo A Goodman. On the estimation of the number of classes in a population. The Annals of Mathematical Statistics, 20(4):572–579, 1949.
  17. I.J. Good. The population frequencies of species and the estimation of population parameters. Biometrika, pages 237–264, 1953.
  18. On sample-based testers. ACM Transactions on Computation Theory (TOCT), 8(2):1–54, 2016.
  19. On the relation between the relative earth mover distance and the variation distance (an exposition). In Oded Goldreich, editor, Computational Complexity and Property Testing: On the Interplay Between Randomness and Computation, pages 141–151. Springer Cham, 2020.
  20. Testing distributions of huge objects. TheoretiCS, 2, 2023.
  21. Steve Hanneke. The optimal sample complexity of PAC learning. Journal of Machine Learning Research, 17(38):1–15, 2016.
  22. Steve Hanneke. CSTheory stackexchange answer. https://cstheory.stackexchange.com/questions/40161/proper-pac-learning-vc-dimension-bounds, 2019. Accessed 2024-09-10.
  23. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. In Proceedings of the Conference on Learning Theory (COLT), pages 3189–3221. PMLR, 2018.
  24. Yi Hao and Alon Orlitsky. The broad optimality of profile maximum likelihood. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  25. Yi Hao and Alon Orlitsky. Unified sample-optimal property estimation in near-linear time. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  26. Estimating the effective support size in constant query complexity. In Symposium on Simplicity in Algorithms (SOSA), pages 242–252, 2023.
  27. OEIS. OEIS sequence A008310. https://oeis.org/A008310. Accessed 2024-08-13.
  28. Almost optimal distribution-free sample-based testing of k-modality. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM), 2020.
  29. Optimal distribution-free sample-based testing of subsequence-freeness with one-sided error. ACM Transactions on Computation Theory (TOCT), 14(1):1–31, 2022.
  30. Strong lower bounds for approximating distribution support size and the distinct elements problem. SIAM Journal on Computing, 39(3):813–842, 2009.
  31. Estimating the unseen: an n/log⁡(n)𝑛𝑛n/\log(n)italic_n / roman_log ( italic_n )-sample estimator for entropy and support size, shown optimal via new CLTs. In Proceedings of the ACM SIGACT Symposium on Theory of Computing (STOC), 2011.
  32. The power of linear estimators. In Proceedings of the IEEE Symposium on Foundations of Computer Science (FOCS), 2011.
  33. Instance optimal learning of discrete distributions. In Proceedings of the ACM SIGACT Symposium on Theory of Computing (STOC), 2016.
  34. Estimating the unseen: improved estimators for entropy and other properties. Journal of the ACM (JACM), 2017.
  35. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857–883, 2019.
  36. Polynomial methods in statistical inference: theory and practice. Foundations and Trends in Communications and Information Theory. now Publishers, 2020.

Summary

We haven't generated a summary for this paper yet.