Differentially Private Frequency Oracle

Updated 26 December 2025

The paper introduces a frequency oracle that estimates histogram counts while providing rigorous ε-LDP guarantees across various mechanisms like RR, HR, and OLH.
It leverages canonical and advanced techniques, including randomized response and Hadamard transforms, to achieve optimal error bounds and near-optimal MSE.
Practical implementations benefit from consistency post-processing and parameter tuning, with extensions to FLDP and streaming data to balance utility and privacy.

A differentially private frequency oracle is an algorithmic framework that allows estimation of the frequency (histogram) of discrete values in a population while providing rigorous differential privacy guarantees for each user's data. In this context, a frequency oracle aims to output, for any value $v$ in the input domain, an estimate $\hat{f}_U[v]$ of $f_U[v] = |\{u : v^{(u)} = v\}|$ such that, with high probability (over the mechanism's internal randomness), the deviation $|\hat{f}_U[v] - f_U[v]|$ is appropriately bounded. The main setting is either the local differential privacy (LDP) model—where data is privatized on the user's device before transmission—or the centralized and shuffle models.

1. Problem Definition and Formal Guarantees

A frequency oracle under $\varepsilon$ -local differential privacy (LDP) operates on a population $U = \{1, \dots, n\}$ , each with a private value $v^{(u)} \in D$ for domain $D$ , $|D| = d$ . The requirement for an LDP mechanism $A: D \to Y$ is, for all $v,v' \in D$ and measurable $S \subseteq Y$ ,

$\Pr[A(v) \in S] \leq e^{\varepsilon} \Pr[A(v') \in S].$

A frequency oracle must provide, for any $v$ , an estimate $\hat{f}_U[v]$ such that for error threshold $\lambda$ and failure probability $\beta$ ,

$\Pr[|\hat{f}_U[v] - f_U[v]| \geq \lambda] \leq \beta,$

with $\lambda$ as small as possible given $n, d, \varepsilon, \beta$ (Wu et al., 2021).

Key information-theoretic lower bounds on the achievable mean squared error (MSE) are $\Omega((e^{\varepsilon}+1)^2/[n (e^{\varepsilon}-1)^2])$ per coordinate for the LDP setting (Lopuhaä-Zwakenberg et al., 2019).

2. Core Algorithmic Techniques

Several algorithmic paradigms achieve these guarantees, using various randomization and estimation schemes.

Canonical LDP Mechanisms

Randomized Response (RR): For binary domains, perturbing each bit achieves the optimal MSE.
Generalized Randomized Response (GRR): Extends RR to $d > 2$ .
Optimized Unary Encoding (OUE)/RAPPOR: Encodes data as one-hot vectors and applies bitwise perturbation.
Hadamard Response (HRR)/Hadamard Transform: Maps domain to Hadamard basis, randomizes, and reconstructs via linear inverses.
Optimized Local Hashing (OLH): Hashes values into a smaller domain and applies randomized response (Lopuhaä-Zwakenberg et al., 2019).

The unbiased estimator for all these approaches takes

$\tilde f_v = \frac{\frac{c_v}{n} - q}{p-q},$

where $p$ is the probability the mechanism supports the true value and $q$ is the support probability for false values.

Sketch-based and Advanced Methods

HadaOracle: A parameterized sketching scheme using Count-Median structure, partitioning users into $k$ groups and hashing their values, each processed via the base $\varepsilon$ -LDP frequency oracle (HRR). The oracle outputs the median estimate from the $k$ sketches, with error $O((1/\varepsilon)\sqrt{n\ln(1/\beta)})$ , which matches the lower bound (Wu et al., 2021).
Subset Selection (SS), Projective Geometry Response (PGR), and Modular Subset Selection (MSS): Advanced compression and coding approaches, using subset selection or residue number systems for composite reporting; these offer near-optimal trade-offs among utility, bandwidth, and server runtime (Feldman et al., 2022, Arcolezi, 14 Nov 2025).
Flexible Hadamard Response (FHR): A mechanism operating under the relaxed $(\varepsilon,\eta)$ -FLDP model, where outputs need only partial overlap for privacy, further improving utility-privacy trade-offs at low privacy budgets (Zhao et al., 2022).
Joint Randomized Response (JRR): Correlates perturbations in random user pairs to reduce variance, improving MSE by up to 100 $\times$ over RR in some regimes, while retaining $\varepsilon$ -LDP (Zheng et al., 15 May 2025).
Shuffle Model Protocols: Leverage a shuffling layer to anonymize messages, allowing protocols such as the "hash-and-mix" approach to achieve error $\omega(1)O(\log n)$ with $1+o(1)$ messages per user, nearly matching central DP and substantially outperforming source LDP models (Luo et al., 2021).

3. Accuracy Analysis and Lower Bounds

Formally, the optimal per-query error (with failure $\beta$ ) under $\varepsilon$ -LDP is

$|\hat{f}_U[v] - f_U[v]| = O\left(\frac{1}{\varepsilon}\sqrt{n \ln(1/\beta)}\right)$

(Wu et al., 2021). This matches the information-theoretic lower bound $\Omega\left((1/\varepsilon)\sqrt{n\ln(1/\beta)}\right)$ (Lopuhaä-Zwakenberg et al., 2019).

For MSE, canonical mechanisms such as HR (Hadamard Response), OLH, and PGR achieve

$\mathrm{MSE} = O\left(\frac{d}{n\varepsilon^2}\right)$

per dimension for $d$ -way marginals (Feldman et al., 2022, Arcolezi, 14 Nov 2025). Subset Selection and its efficient variants (such as MSS, PGR) offer essentially optimal MSE but trade off decoding time and user communication.

Mechanism comparison table:

Mechanism	Per-user Comm.	Server Time	Worst-case Error/MSE
GRR/RR	$\log_2 d$	$O(n+d)$	$O\left(\frac{d}{n\epsilon^2}\right)$
OUE/RAPPOR	$d$	$O(nd)$	$O\left(\frac{d}{n\epsilon^2}\right)$
OLH	$\log_2 d$	$O(n+d)$	$O\left(\frac{d}{n\epsilon^2}\right)$
HR/HRR	$\log_2 d$	$O(n + d\log d)$	$O\left(\frac{d}{n\epsilon^2}\right)$
PGR/MSS	$\log_2 d$	$O(n + d \log d)$	$O\left(\frac{d}{n\epsilon^2}\right)$
HadaOracle	$1$	$\tilde{O}(n)$	$O\left(\frac{1}{\epsilon}\sqrt{n\ln(1/\beta)}\right)$
Shuffle-HashMix	$1 + o(1)$	$O(n)$	$O(\log n)$

4. Post-processing and Consistency Correction

Imposing consistency constraints—probabilities non-negative and summing to one—on the histogram estimator substantially reduces error. Simple $O(d)$ projections (such as the $L_2$ simplex projection "Norm-Sub") or maximum-likelihood post-processing under Gaussian assumptions typically yield reductions in MSE of $10\times$ to $100\times$ on real and synthetic data, especially on full-domain and subset queries (Wang et al., 2019). For heavy hitters (top- $k$ ) queries, minimal post-processing (additive normalization or none) may be optimal due to potential bias-shrinkage effects.

5. Advanced Models and Extensions

Flexible Local Differential Privacy

Relaxing the requirement of indistinguishability for all domain elements (parametrized by $\eta$ ) yields $(\varepsilon,\eta)$ -FLDP, enabling frequency oracles with lower variance, particularly for small $\varepsilon$ (Zhao et al., 2022). The Flexible Hadamard Response (FHR) mechanism operates efficiently under FLDP, with communication $O(\log d)$ and server cost $O(n + d\log d)$ .

Multidimensional and Sparse Data

For multidimensional data, strategies such as "Random Sampling plus Fake Data" (RS+FD) enable the collection of multiple attribute marginals while maintaining $\epsilon$ -LDP. RS+FD provides a privacy-amplified estimator by randomly selecting and perturbing one attribute per user and filling other attributes with fake data, resulting in MSE nearly matching the best single-attribute Smp approach while offering indistinguishability protection across attributes (Arcolezi et al., 2021). For high-dimensional sparse cubes, private publication based on direct summarization, threshold sampling, and dyadic decomposition allows point/range queries to be answered efficiently under $\epsilon$ -DP with compact $O(n)$ -size summaries (Cormode et al., 2011).

6. Streaming and Sliding Window Frequency Oracles

Streaming scenarios, such as frequency estimation over sliding windows, require incremental algorithms. DPSW-Sketch applies the Count-Min Sketch with added Gaussian noise (zCDP mechanisms), extended via smooth-histogram decomposition, to track counts over the most recent $w$ items with $(\epsilon,\delta)$ -DP. Space and update time scale as $O(\sqrt{w})$ with provable error bounds comparable to non-private Count-Min Sketches (Wang et al., 2024).

7. Practical Considerations and Implementation Notes

Parameter tuning is critical; choices for the number of hash functions, sketch width/depth, and noise scales depend on $n, d, \varepsilon, \beta$ target error, and model specifics. For HadaOracle, $k=C\ln(1/\beta)$ partitions and $m=\Theta((1/\epsilon)\sqrt{n})$ suffice, with $C\approx 8$ a practical choice (Wu et al., 2021). For modular and geometry-based methods, careful modulus and field parameter selection achieves $O(k\log k)$ server runtime and maintains low communication (Arcolezi, 14 Nov 2025, Feldman et al., 2022).

Post-processing via consistency projection is recommended for full-domain or subset queries, while careful parameterization of the privacy-utility trade-off is necessary to adapt to application requirements. Empirical results consistently demonstrate that advanced mechanisms (HadaOracle, PGR, MSS) match or surpass naive LDP estimators while offering significant efficiency or robustness advantages.

References

"Asymptotically Optimal Locally Private Heavy Hitters via Parameterized Sketches" (Wu et al., 2021)
"Improving Frequency Estimation under Local Differential Privacy" (Lopuhaä-Zwakenberg et al., 2019)
"Private Frequency Estimation Via Residue Number Systems" (Arcolezi, 14 Nov 2025)
"Private Frequency Estimation via Projective Geometry" (Feldman et al., 2022)
"FLDP: Flexible strategy for local differential privacy" (Zhao et al., 2022)
"Locally Differentially Private Frequency Estimation via Joint Randomized Response" (Zheng et al., 15 May 2025)
"Random Sampling Plus Fake Data: Multidimensional Frequency Estimates With Local Differential Privacy" (Arcolezi et al., 2021)
"Differentially Private Publication of Sparse Data" (Cormode et al., 2011)
"DPSW-Sketch: A Differentially Private Sketch Framework for Frequency Estimation over Sliding Windows" (Wang et al., 2024)
"Locally Differentially Private Frequency Estimation with Consistency" (Wang et al., 2019)
"Frequency Estimation in the Shuffle Model with Almost a Single Message" (Luo et al., 2021)