Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Near-optimal algorithms for private estimation and sequential testing of collision probability (2504.13804v1)

Published 18 Apr 2025 in stat.ML, cs.AI, and cs.LG

Abstract: We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(\alpha, \beta)$-local differential privacy and estimates collision probability with error at most $\epsilon$ using $\tilde{O}\left(\frac{\log(1/\beta)}{\alpha2 \epsilon2}\right)$ samples for $\alpha \le 1$, which improves over previous work by a factor of $\frac{1}{\alpha2}$. We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by $\epsilon$ using $\tilde{O}(\frac{1}{\epsilon2})$ samples, even when $\epsilon$ is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.

Summary

  • The paper introduces near-optimal algorithms for the private estimation and sequential testing of discrete distribution collision probability, leveraging information from all pairs of samples.
  • The proposed private estimation algorithm achieves improved, near-optimal sample complexity under local differential privacy, outperforming prior methods especially for small privacy budgets.
  • A new sequential testing algorithm adapts to unknown problem difficulty and demonstrates near-optimal sample complexity for distinguishing collision probabilities, also extendable to a private version.

This paper presents near-optimal algorithms for two problems related to the collision probability C(p)=ipi2C(p) = \sum_i p_i^2 of a discrete distribution pp: private estimation under local differential privacy (LDP) and sequential hypothesis testing. Collision probability is a fundamental measure of distribution spread with applications in ecology (Simpson index), economics (Herfindahl–Hirschman index), databases (join size estimation), and statistics (related to Rényi entropy). A key advantage of the proposed algorithms is their efficient use of data, examining Θ(n2)\Theta(n^2) pairs of samples to estimate collisions, unlike previous methods using only O(n)O(n) pairs, leading to improved sample complexity.

Private Estimation of Collision Probability

  • Problem: Estimate C(p)C(p) with additive error ϵ\epsilon and confidence 1δ1-\delta, while satisfying (α,β)(\alpha, \beta)-local differential privacy (LDP). LDP ensures privacy even if the central server is untrusted.
  • Proposed Mechanism (Mechanism 1):

1. A central server coordinates NN users, partitioned into gg groups. The number of users NjN_j in group jj follows a Poisson distribution with mean m=n/gm=n/g, where nn is the total expected sample size. 2. The server sends a random hash function h:{0,1}{1,+1}h: \{0, 1\}^* \mapsto \{-1, +1\} to all users. 3. Each user ii draws a sample xix_i from pp, chooses a random salt sis_i from {1,,r}\{1, \ldots, r\} (where rr depends on α,β\alpha, \beta), and sends the single hash bit vi=h(ji,si,xi)v_i = h(\langle j_i, s_i, x_i \rangle) to the server, where jij_i is the user's group index. Salts enhance privacy. 4. The server computes Vj=iIjviV_j = \sum_{i \in I_j} v_i for each group jj. 5. An initial estimate for each group is calculated as Cj=r(Vj2m)m2C_j = \frac{r(V^2_j - m)}{m^2}. This debiases the squared sum of hash values. 6. To improve robustness, the groups are partitioned into supergroups, the estimates within each supergroup are averaged (CC_\ell), and the final estimate CC is the median of these supergroup averages.

  • Guarantees:
    • Privacy: Mechanism 1 satisfies (α,β)(\alpha, \beta)-LDP (Theorem 1).
    • Sample Complexity: For α1\alpha \le 1, it achieves additive error ϵ\epsilon with probability 1δ1-\delta using n=O~(log(1/β)α2ϵ2)n = \tilde{O}\left(\frac{\log(1/\beta)}{\alpha^2 \epsilon^2}\right) expected samples (Corollary 1). This improves upon the previous best LDP result of O~(1α4ϵ2)\tilde{O}\left(\frac{1}{\alpha^4 \epsilon^2}\right) by \citet{BravoHermsdorff2022} (Bravo-Hermsdorff et al., 2023).
    • Near-Optimality: The sample complexity is shown to be near-optimal for small α\alpha via a lower bound (Theorem 3).
  • Comparison: Unlike prior LDP work, this mechanism uses privately chosen salts and collision information from all Θ(n2)\Theta(n^2) pairs (implicitly via Vj2V_j^2), leading to better dependence on α\alpha. It avoids the support size dependency of earlier work \citep{butucea2021locally} (Manjegani et al., 2021).
  • Theoretical Separation: The paper shows that the alternative approach of privately estimating the distribution pp first (using an optimal LDP estimator) and then computing the collision probability of the estimate can lead to suboptimal sample complexity that depends on the support size kk (Theorem 4), unlike the proposed direct method.

Sequential Testing of Collision Probability

  • Problem: Decide between the null hypothesis H0:C(p)=c0H_0: C(p) = c_0 and the alternative H1:C(p)c0ϵ>0H_1: |C(p) - c_0| \ge \epsilon > 0, given c0c_0 but without knowing ϵ\epsilon. The algorithm draws samples sequentially and stops when confident.
  • Proposed Algorithm (Algorithm 2):

1. Draw samples x1,x2,x_1, x_2, \ldots one by one. 2. Maintain a running statistic based on observed collisions. At step ii, compute Ti=j=1i11{xi=xj}2(i1)c0T_i = \sum_{j=1}^{i-1} \mathbf{1}\{x_i = x_j\} - 2(i - 1)c_0. 3. Compute the centered U-statistic based on the T's: Ui=2i(i1)j=1iTjU'_i = \frac{2}{i(i-1)}\sum_{j=1}^i T_j. This is an estimate of C(p)c0C(p) - c_0. 4. Compare Ui|U'_i| to a time-varying threshold ϕ(i,δ)=O((loglogi+log(1/δ))/i)\phi(i, \delta) = O(\sqrt{(\log \log i + \log(1/\delta))/i}). 5. Reject H0H_0 if Ui>ϕ(i,δ)|U'_i| > \phi(i, \delta). Otherwise, continue sampling.

  • Guarantees:
    • Correctness: If H0H_0 is true, the algorithm rejects with probability at most δ\delta. If H1H_1 is true, it rejects with probability at least 1δ1-\delta.
    • Sample Complexity: If H1H_1 holds with separation ϵ\epsilon, the algorithm stops after N=O~(1ϵ2)N = \tilde{O}(\frac{1}{\epsilon^2}) samples with high probability (Theorem 4). The loglog(1/ϵ)\log\log(1/\epsilon) factor is negligible in practice. This is near-optimal (Theorem 5).
  • Comparison: Outperforms prior sequential testing work \citep{das_2017} which uses a "doubling trick" and requires partitioning samples into pairs (using only O(n)O(n) pairs implicitly). Algorithm 2 uses all Θ(n2)\Theta(n^2) pairs and avoids discarding information or requiring sample partitioning, leading to better empirical performance, especially when ϵ\epsilon is small. It also adapts automatically to the unknown ϵ\epsilon.

Private Sequential Testing

  • The hashing technique from Mechanism 1 can be combined with the sequential testing logic of Algorithm 2 to create a Private Sequential Tester (PSQ).
  • PSQ (Algorithm 3, Appendix) applies the salted hashing to the inputs of the sequential tester. It adjusts the null hypothesis value c0c_0 and the threshold to account for the bias and scaling introduced by hashing.
  • Guarantees: PSQ achieves (α,β)(\alpha, \beta)-LDP. If C(p)c0ϵ|C(p)-c_0| \ge \epsilon and α1\alpha \le 1, it stops after N=O(loglog(1/ϵ)ϵ2log2(1/β)log(1/δ)α2)N = O \left(\frac{\log \log (1/\epsilon)}{\epsilon^2} \cdot \frac{\log^2 (1/\beta) \log (1/\delta)}{\alpha^2}\right) samples (Theorem 7).
  • Comparison: PSQ generally requires fewer samples than a simpler "Doubling Tester" approach (Theorem 6) which repeatedly runs Mechanism 1 with increasing sample sizes, especially when ϵ\epsilon is small (i.e., log(1/ϵ)\log(1/\epsilon) is large compared to log(1/β)\log(1/\beta)).

Experiments

  • Private Estimation: Mechanism 1 shows significantly lower error than the mechanism from \citet{BravoHermsdorff2022} (Bravo-Hermsdorff et al., 2023), especially for small α\alpha. It also outperforms the indirect method of estimating the distribution first.
  • Sequential Testing: Algorithm 2 requires substantially fewer samples than the adapted version of \citet{das_2017}'s tester, particularly when the true collision probability is close to the null hypothesis (c0c_0). It also performs as well as or better than batch U-statistic and plug-in testers.
  • Private Sequential Testing: PSQ requires fewer samples than the Doubling Tester baseline, though both require significantly more samples (3x-17x factor for PSQ) than the non-private sequential tester.

In summary, the paper introduces near-optimal and practically efficient algorithms for private estimation and sequential testing of collision probability by leveraging information from all pairs of samples. The private estimator improves state-of-the-art sample complexity, while the sequential tester adapts to unknown problem difficulty and outperforms existing sequential and batch methods.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 12 likes.

Upgrade to Pro to view all of the tweets about this paper: