Near-optimal algorithms for private estimation and sequential testing of collision probability (2504.13804v1)

Published 18 Apr 2025 in stat.ML, cs.AI, and cs.LG

Abstract: We present new algorithms for estimating and testing \emph{collision probability}, a fundamental measure of the spread of a discrete distribution that is widely used in many scientific fields. We describe an algorithm that satisfies $(\alpha, \beta)$-local differential privacy and estimates collision probability with error at most $\epsilon$ using $\tilde{O}\left(\frac{\log(1/\beta)}{\alpha² \epsilon^2}\right)$ samples for $\alpha \le 1$, which improves over previous work by a factor of $\frac{1}{\alpha^2}$. We also present a sequential testing algorithm for collision probability, which can distinguish between collision probability values that are separated by $\epsilon$ using $\tilde{O}(\frac{1}{\epsilon^2})$ samples, even when $\epsilon$ is unknown. Our algorithms have nearly the optimal sample complexity, and in experiments we show that they require significantly fewer samples than previous methods.

Summary

The paper introduces near-optimal algorithms for the private estimation and sequential testing of discrete distribution collision probability, leveraging information from all pairs of samples.
The proposed private estimation algorithm achieves improved, near-optimal sample complexity under local differential privacy, outperforming prior methods especially for small privacy budgets.
A new sequential testing algorithm adapts to unknown problem difficulty and demonstrates near-optimal sample complexity for distinguishing collision probabilities, also extendable to a private version.

This paper presents near-optimal algorithms for two problems related to the collision probability $C(p) = \sum_i p_i^2$ of a discrete distribution $p$ : private estimation under local differential privacy (LDP) and sequential hypothesis testing. Collision probability is a fundamental measure of distribution spread with applications in ecology (Simpson index), economics (Herfindahl–Hirschman index), databases (join size estimation), and statistics (related to Rényi entropy). A key advantage of the proposed algorithms is their efficient use of data, examining $\Theta(n^2)$ pairs of samples to estimate collisions, unlike previous methods using only $O(n)$ pairs, leading to improved sample complexity.

Private Estimation of Collision Probability

Problem: Estimate $C(p)$ with additive error $\epsilon$ and confidence $1-\delta$ , while satisfying $(\alpha, \beta)$ -local differential privacy (LDP). LDP ensures privacy even if the central server is untrusted.
Proposed Mechanism (Mechanism 1):

1. A central server coordinates $N$ users, partitioned into $g$ groups. The number of users $N_j$ in group $j$ follows a Poisson distribution with mean $m=n/g$ , where $n$ is the total expected sample size. 2. The server sends a random hash function $h: \{0, 1\}^* \mapsto \{-1, +1\}$ to all users. 3. Each user $i$ draws a sample $x_i$ from $p$ , chooses a random salt $s_i$ from $\{1, \ldots, r\}$ (where $r$ depends on $\alpha, \beta$ ), and sends the single hash bit $v_i = h(\langle j_i, s_i, x_i \rangle)$ to the server, where $j_i$ is the user's group index. Salts enhance privacy. 4. The server computes $V_j = \sum_{i \in I_j} v_i$ for each group $j$ . 5. An initial estimate for each group is calculated as $C_j = \frac{r(V^2_j - m)}{m^2}$ . This debiases the squared sum of hash values. 6. To improve robustness, the groups are partitioned into supergroups, the estimates within each supergroup are averaged ( $C_\ell$ ), and the final estimate $C$ is the median of these supergroup averages.

Guarantees:
- Privacy: Mechanism 1 satisfies $(\alpha, \beta)$ -LDP (Theorem 1).
- Sample Complexity: For $\alpha \le 1$ , it achieves additive error $\epsilon$ with probability $1-\delta$ using $n = \tilde{O}\left(\frac{\log(1/\beta)}{\alpha^2 \epsilon^2}\right)$ expected samples (Corollary 1). This improves upon the previous best LDP result of $\tilde{O}\left(\frac{1}{\alpha^4 \epsilon^2}\right)$ by \citet{BravoHermsdorff2022} (Bravo-Hermsdorff et al., 2023).
- Near-Optimality: The sample complexity is shown to be near-optimal for small $\alpha$ via a lower bound (Theorem 3).
Comparison: Unlike prior LDP work, this mechanism uses privately chosen salts and collision information from all $\Theta(n^2)$ pairs (implicitly via $V_j^2$ ), leading to better dependence on $\alpha$ . It avoids the support size dependency of earlier work \citep{butucea2021locally} (Manjegani et al., 2021).
Theoretical Separation: The paper shows that the alternative approach of privately estimating the distribution $p$ first (using an optimal LDP estimator) and then computing the collision probability of the estimate can lead to suboptimal sample complexity that depends on the support size $k$ (Theorem 4), unlike the proposed direct method.

Sequential Testing of Collision Probability

Problem: Decide between the null hypothesis $H_0: C(p) = c_0$ and the alternative $H_1: |C(p) - c_0| \ge \epsilon > 0$ , given $c_0$ but without knowing $\epsilon$ . The algorithm draws samples sequentially and stops when confident.
Proposed Algorithm (Algorithm 2):

1. Draw samples $x_1, x_2, \ldots$ one by one. 2. Maintain a running statistic based on observed collisions. At step $i$ , compute $T_i = \sum_{j=1}^{i-1} \mathbf{1}\{x_i = x_j\} - 2(i - 1)c_0$ . 3. Compute the centered U-statistic based on the T's: $U'_i = \frac{2}{i(i-1)}\sum_{j=1}^i T_j$ . This is an estimate of $C(p) - c_0$ . 4. Compare $|U'_i|$ to a time-varying threshold $\phi(i, \delta) = O(\sqrt{(\log \log i + \log(1/\delta))/i})$ . 5. Reject $H_0$ if $|U'_i| > \phi(i, \delta)$ . Otherwise, continue sampling.

Guarantees:
- Correctness: If $H_0$ is true, the algorithm rejects with probability at most $\delta$ . If $H_1$ is true, it rejects with probability at least $1-\delta$ .
- Sample Complexity: If $H_1$ holds with separation $\epsilon$ , the algorithm stops after $N = \tilde{O}(\frac{1}{\epsilon^2})$ samples with high probability (Theorem 4). The $\log\log(1/\epsilon)$ factor is negligible in practice. This is near-optimal (Theorem 5).
Comparison: Outperforms prior sequential testing work \citep{das_2017} which uses a "doubling trick" and requires partitioning samples into pairs (using only $O(n)$ pairs implicitly). Algorithm 2 uses all $\Theta(n^2)$ pairs and avoids discarding information or requiring sample partitioning, leading to better empirical performance, especially when $\epsilon$ is small. It also adapts automatically to the unknown $\epsilon$ .

Private Sequential Testing

The hashing technique from Mechanism 1 can be combined with the sequential testing logic of Algorithm 2 to create a Private Sequential Tester (PSQ).
PSQ (Algorithm 3, Appendix) applies the salted hashing to the inputs of the sequential tester. It adjusts the null hypothesis value $c_0$ and the threshold to account for the bias and scaling introduced by hashing.
Guarantees: PSQ achieves $(\alpha, \beta)$ -LDP. If $|C(p)-c_0| \ge \epsilon$ and $\alpha \le 1$ , it stops after $N = O \left(\frac{\log \log (1/\epsilon)}{\epsilon^2} \cdot \frac{\log^2 (1/\beta) \log (1/\delta)}{\alpha^2}\right)$ samples (Theorem 7).
Comparison: PSQ generally requires fewer samples than a simpler "Doubling Tester" approach (Theorem 6) which repeatedly runs Mechanism 1 with increasing sample sizes, especially when $\epsilon$ is small (i.e., $\log(1/\epsilon)$ is large compared to $\log(1/\beta)$ ).

Experiments

Private Estimation: Mechanism 1 shows significantly lower error than the mechanism from \citet{BravoHermsdorff2022} (Bravo-Hermsdorff et al., 2023), especially for small $\alpha$ . It also outperforms the indirect method of estimating the distribution first.
Sequential Testing: Algorithm 2 requires substantially fewer samples than the adapted version of \citet{das_2017}'s tester, particularly when the true collision probability is close to the null hypothesis ( $c_0$ ). It also performs as well as or better than batch U-statistic and plug-in testers.
Private Sequential Testing: PSQ requires fewer samples than the Doubling Tester baseline, though both require significantly more samples (3x-17x factor for PSQ) than the non-private sequential tester.

In summary, the paper introduces near-optimal and practically efficient algorithms for private estimation and sequential testing of collision probability by leveraging information from all pairs of samples. The private estimator improves state-of-the-art sample complexity, while the sequential tester adapts to unknown problem difficulty and outperforms existing sequential and batch methods.