Papers
Topics
Authors
Recent
2000 character limit reached

Testing Data Binnings

Published 27 Apr 2020 in cs.DS and cs.DM | (2004.12893v1)

Abstract: Motivated by the question of data quantization and "binning," we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) $\mathbf{q}$ and samples from an unknown distribution $\mathbf{p}$, both over $[n]={1,2,\dots,n}$, whether $\mathbf{p}$ equals $\mathbf{q}$, or is significantly different from it. In this paper, we introduce the related question of 'identity up to binning,' where the reference distribution $\mathbf{q}$ is over $k \ll n$ elements: the question is then whether there exists a suitable binning of the domain $[n]$ into $k$ intervals such that, once "binned," $\mathbf{p}$ is equal to $\mathbf{q}$. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne (2019). Finally, we discuss several extensions and related research directions.

Citations (5)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.