Near-optimal algorithms for private estimation and sequential testing of collision probability
(2504.13804v1)
Published 18 Apr 2025 in stat.ML, cs.AI, and cs.LG
Abstract: We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(\alpha, \beta)$-local differential privacy and estimates collision probability with error at most $\epsilon$ using $\tilde{O}\left(\frac{\log(1/\beta)}{\alpha2 \epsilon2}\right)$ samples for $\alpha \le 1$, which improves over previous work by a factor of $\frac{1}{\alpha2}$. We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by $\epsilon$ using $\tilde{O}(\frac{1}{\epsilon2})$ samples, even when $\epsilon$ is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.
Summary
The paper introduces near-optimal algorithms for the private estimation and sequential testing of discrete distribution collision probability, leveraging information from all pairs of samples.
The proposed private estimation algorithm achieves improved, near-optimal sample complexity under local differential privacy, outperforming prior methods especially for small privacy budgets.
A new sequential testing algorithm adapts to unknown problem difficulty and demonstrates near-optimal sample complexity for distinguishing collision probabilities, also extendable to a private version.
This paper presents near-optimal algorithms for two problems related to the collision probability C(p)=∑ipi2 of a discrete distribution p: private estimation under local differential privacy (LDP) and sequential hypothesis testing. Collision probability is a fundamental measure of distribution spread with applications in ecology (Simpson index), economics (Herfindahl–Hirschman index), databases (join size estimation), and statistics (related to Rényi entropy). A key advantage of the proposed algorithms is their efficient use of data, examining Θ(n2) pairs of samples to estimate collisions, unlike previous methods using only O(n) pairs, leading to improved sample complexity.
Private Estimation of Collision Probability
Problem: Estimate C(p) with additive error ϵ and confidence 1−δ, while satisfying (α,β)-local differential privacy (LDP). LDP ensures privacy even if the central server is untrusted.
Proposed Mechanism (Mechanism 1):
1. A central server coordinates N users, partitioned into g groups. The number of users Nj in group j follows a Poisson distribution with mean m=n/g, where n is the total expected sample size.
2. The server sends a random hash function h:{0,1}∗↦{−1,+1} to all users.
3. Each user i draws a sample xi from p, chooses a random salt si from {1,…,r} (where r depends on α,β), and sends the single hash bit vi=h(⟨ji,si,xi⟩) to the server, where ji is the user's group index. Salts enhance privacy.
4. The server computes Vj=∑i∈Ijvi for each group j.
5. An initial estimate for each group is calculated as Cj=m2r(Vj2−m). This debiases the squared sum of hash values.
6. To improve robustness, the groups are partitioned into supergroups, the estimates within each supergroup are averaged (Cℓ), and the final estimate C is the median of these supergroup averages.
Sample Complexity: For α≤1, it achieves additive error ϵ with probability 1−δ using n=O~(α2ϵ2log(1/β)) expected samples (Corollary 1). This improves upon the previous best LDP result of O~(α4ϵ21) by \citet{BravoHermsdorff2022} (Bravo-Hermsdorff et al., 2023).
Near-Optimality: The sample complexity is shown to be near-optimal for small α via a lower bound (Theorem 3).
Comparison: Unlike prior LDP work, this mechanism uses privately chosen salts and collision information from all Θ(n2) pairs (implicitly via Vj2), leading to better dependence on α. It avoids the support size dependency of earlier work \citep{butucea2021locally} (Manjegani et al., 2021).
Theoretical Separation: The paper shows that the alternative approach of privately estimating the distribution p first (using an optimal LDP estimator) and then computing the collision probability of the estimate can lead to suboptimal sample complexity that depends on the support size k (Theorem 4), unlike the proposed direct method.
Sequential Testing of Collision Probability
Problem: Decide between the null hypothesis H0:C(p)=c0 and the alternative H1:∣C(p)−c0∣≥ϵ>0, given c0 but without knowingϵ. The algorithm draws samples sequentially and stops when confident.
Proposed Algorithm (Algorithm 2):
1. Draw samples x1,x2,… one by one.
2. Maintain a running statistic based on observed collisions. At step i, compute Ti=j=1∑i−11{xi=xj}−2(i−1)c0.
3. Compute the centered U-statistic based on the T's: Ui′=i(i−1)2∑j=1iTj. This is an estimate of C(p)−c0.
4. Compare ∣Ui′∣ to a time-varying threshold ϕ(i,δ)=O((loglogi+log(1/δ))/i).
5. Reject H0 if ∣Ui′∣>ϕ(i,δ). Otherwise, continue sampling.
Guarantees:
Correctness: If H0 is true, the algorithm rejects with probability at most δ. If H1 is true, it rejects with probability at least 1−δ.
Sample Complexity: If H1 holds with separation ϵ, the algorithm stops after N=O~(ϵ21) samples with high probability (Theorem 4). The loglog(1/ϵ) factor is negligible in practice. This is near-optimal (Theorem 5).
Comparison: Outperforms prior sequential testing work \citep{das_2017} which uses a "doubling trick" and requires partitioning samples into pairs (using only O(n) pairs implicitly). Algorithm 2 uses all Θ(n2) pairs and avoids discarding information or requiring sample partitioning, leading to better empirical performance, especially when ϵ is small. It also adapts automatically to the unknown ϵ.
Private Sequential Testing
The hashing technique from Mechanism 1 can be combined with the sequential testing logic of Algorithm 2 to create a Private Sequential Tester (PSQ).
PSQ (Algorithm 3, Appendix) applies the salted hashing to the inputs of the sequential tester. It adjusts the null hypothesis value c0 and the threshold to account for the bias and scaling introduced by hashing.
Guarantees: PSQ achieves (α,β)-LDP. If ∣C(p)−c0∣≥ϵ and α≤1, it stops after N=O(ϵ2loglog(1/ϵ)⋅α2log2(1/β)log(1/δ)) samples (Theorem 7).
Comparison: PSQ generally requires fewer samples than a simpler "Doubling Tester" approach (Theorem 6) which repeatedly runs Mechanism 1 with increasing sample sizes, especially when ϵ is small (i.e., log(1/ϵ) is large compared to log(1/β)).
Experiments
Private Estimation: Mechanism 1 shows significantly lower error than the mechanism from \citet{BravoHermsdorff2022} (Bravo-Hermsdorff et al., 2023), especially for small α. It also outperforms the indirect method of estimating the distribution first.
Sequential Testing: Algorithm 2 requires substantially fewer samples than the adapted version of \citet{das_2017}'s tester, particularly when the true collision probability is close to the null hypothesis (c0). It also performs as well as or better than batch U-statistic and plug-in testers.
Private Sequential Testing: PSQ requires fewer samples than the Doubling Tester baseline, though both require significantly more samples (3x-17x factor for PSQ) than the non-private sequential tester.
In summary, the paper introduces near-optimal and practically efficient algorithms for private estimation and sequential testing of collision probability by leveraging information from all pairs of samples. The private estimator improves state-of-the-art sample complexity, while the sequential tester adapts to unknown problem difficulty and outperforms existing sequential and batch methods.