Papers
Topics
Authors
Recent
2000 character limit reached

Differentially Private Frequency Oracle

Updated 26 December 2025
  • The paper introduces a frequency oracle that estimates histogram counts while providing rigorous ε-LDP guarantees across various mechanisms like RR, HR, and OLH.
  • It leverages canonical and advanced techniques, including randomized response and Hadamard transforms, to achieve optimal error bounds and near-optimal MSE.
  • Practical implementations benefit from consistency post-processing and parameter tuning, with extensions to FLDP and streaming data to balance utility and privacy.

A differentially private frequency oracle is an algorithmic framework that allows estimation of the frequency (histogram) of discrete values in a population while providing rigorous differential privacy guarantees for each user's data. In this context, a frequency oracle aims to output, for any value vv in the input domain, an estimate f^U[v]\hat{f}_U[v] of fU[v]={u:v(u)=v}f_U[v] = |\{u : v^{(u)} = v\}| such that, with high probability (over the mechanism's internal randomness), the deviation f^U[v]fU[v]|\hat{f}_U[v] - f_U[v]| is appropriately bounded. The main setting is either the local differential privacy (LDP) model—where data is privatized on the user's device before transmission—or the centralized and shuffle models.

1. Problem Definition and Formal Guarantees

A frequency oracle under ε\varepsilon-local differential privacy (LDP) operates on a population U={1,,n}U = \{1, \dots, n\}, each with a private value v(u)Dv^{(u)} \in D for domain DD, D=d|D| = d. The requirement for an LDP mechanism A:DYA: D \to Y is, for all v,vDv,v' \in D and measurable SYS \subseteq Y,

Pr[A(v)S]eεPr[A(v)S].\Pr[A(v) \in S] \leq e^{\varepsilon} \Pr[A(v') \in S].

A frequency oracle must provide, for any vv, an estimate f^U[v]\hat{f}_U[v] such that for error threshold λ\lambda and failure probability β\beta,

Pr[f^U[v]fU[v]λ]β,\Pr[|\hat{f}_U[v] - f_U[v]| \geq \lambda] \leq \beta,

with λ\lambda as small as possible given n,d,ε,βn, d, \varepsilon, \beta (Wu et al., 2021).

Key information-theoretic lower bounds on the achievable mean squared error (MSE) are Ω((eε+1)2/[n(eε1)2])\Omega((e^{\varepsilon}+1)^2/[n (e^{\varepsilon}-1)^2]) per coordinate for the LDP setting (Lopuhaä-Zwakenberg et al., 2019).

2. Core Algorithmic Techniques

Several algorithmic paradigms achieve these guarantees, using various randomization and estimation schemes.

Canonical LDP Mechanisms

  • Randomized Response (RR): For binary domains, perturbing each bit achieves the optimal MSE.
  • Generalized Randomized Response (GRR): Extends RR to d>2d > 2.
  • Optimized Unary Encoding (OUE)/RAPPOR: Encodes data as one-hot vectors and applies bitwise perturbation.
  • Hadamard Response (HRR)/Hadamard Transform: Maps domain to Hadamard basis, randomizes, and reconstructs via linear inverses.
  • Optimized Local Hashing (OLH): Hashes values into a smaller domain and applies randomized response (Lopuhaä-Zwakenberg et al., 2019).

The unbiased estimator for all these approaches takes

f~v=cvnqpq,\tilde f_v = \frac{\frac{c_v}{n} - q}{p-q},

where pp is the probability the mechanism supports the true value and qq is the support probability for false values.

Sketch-based and Advanced Methods

  • HadaOracle: A parameterized sketching scheme using Count-Median structure, partitioning users into kk groups and hashing their values, each processed via the base ε\varepsilon-LDP frequency oracle (HRR). The oracle outputs the median estimate from the kk sketches, with error O((1/ε)nln(1/β))O((1/\varepsilon)\sqrt{n\ln(1/\beta)}), which matches the lower bound (Wu et al., 2021).
  • Subset Selection (SS), Projective Geometry Response (PGR), and Modular Subset Selection (MSS): Advanced compression and coding approaches, using subset selection or residue number systems for composite reporting; these offer near-optimal trade-offs among utility, bandwidth, and server runtime (Feldman et al., 2022, Arcolezi, 14 Nov 2025).
  • Flexible Hadamard Response (FHR): A mechanism operating under the relaxed (ε,η)(\varepsilon,\eta)-FLDP model, where outputs need only partial overlap for privacy, further improving utility-privacy trade-offs at low privacy budgets (Zhao et al., 2022).
  • Joint Randomized Response (JRR): Correlates perturbations in random user pairs to reduce variance, improving MSE by up to 100×\times over RR in some regimes, while retaining ε\varepsilon-LDP (Zheng et al., 15 May 2025).
  • Shuffle Model Protocols: Leverage a shuffling layer to anonymize messages, allowing protocols such as the "hash-and-mix" approach to achieve error ω(1)O(logn)\omega(1)O(\log n) with $1+o(1)$ messages per user, nearly matching central DP and substantially outperforming source LDP models (Luo et al., 2021).

3. Accuracy Analysis and Lower Bounds

Formally, the optimal per-query error (with failure β\beta) under ε\varepsilon-LDP is

f^U[v]fU[v]=O(1εnln(1/β))|\hat{f}_U[v] - f_U[v]| = O\left(\frac{1}{\varepsilon}\sqrt{n \ln(1/\beta)}\right)

(Wu et al., 2021). This matches the information-theoretic lower bound Ω((1/ε)nln(1/β))\Omega\left((1/\varepsilon)\sqrt{n\ln(1/\beta)}\right) (Lopuhaä-Zwakenberg et al., 2019).

For MSE, canonical mechanisms such as HR (Hadamard Response), OLH, and PGR achieve

MSE=O(dnε2)\mathrm{MSE} = O\left(\frac{d}{n\varepsilon^2}\right)

per dimension for dd-way marginals (Feldman et al., 2022, Arcolezi, 14 Nov 2025). Subset Selection and its efficient variants (such as MSS, PGR) offer essentially optimal MSE but trade off decoding time and user communication.

Mechanism comparison table:

Mechanism Per-user Comm. Server Time Worst-case Error/MSE
GRR/RR log2d\log_2 d O(n+d)O(n+d) O(dnϵ2)O\left(\frac{d}{n\epsilon^2}\right)
OUE/RAPPOR dd O(nd)O(nd) O(dnϵ2)O\left(\frac{d}{n\epsilon^2}\right)
OLH log2d\log_2 d O(n+d)O(n+d) O(dnϵ2)O\left(\frac{d}{n\epsilon^2}\right)
HR/HRR log2d\log_2 d O(n+dlogd)O(n + d\log d) O(dnϵ2)O\left(\frac{d}{n\epsilon^2}\right)
PGR/MSS log2d\log_2 d O(n+dlogd)O(n + d \log d) O(dnϵ2)O\left(\frac{d}{n\epsilon^2}\right)
HadaOracle $1$ O~(n)\tilde{O}(n) O(1ϵnln(1/β))O\left(\frac{1}{\epsilon}\sqrt{n\ln(1/\beta)}\right)
Shuffle-HashMix $1 + o(1)$ O(n)O(n) O(logn)O(\log n)

4. Post-processing and Consistency Correction

Imposing consistency constraints—probabilities non-negative and summing to one—on the histogram estimator substantially reduces error. Simple O(d)O(d) projections (such as the L2L_2 simplex projection "Norm-Sub") or maximum-likelihood post-processing under Gaussian assumptions typically yield reductions in MSE of 10×10\times to 100×100\times on real and synthetic data, especially on full-domain and subset queries (Wang et al., 2019). For heavy hitters (top-kk) queries, minimal post-processing (additive normalization or none) may be optimal due to potential bias-shrinkage effects.

5. Advanced Models and Extensions

Flexible Local Differential Privacy

Relaxing the requirement of indistinguishability for all domain elements (parametrized by η\eta) yields (ε,η)(\varepsilon,\eta)-FLDP, enabling frequency oracles with lower variance, particularly for small ε\varepsilon (Zhao et al., 2022). The Flexible Hadamard Response (FHR) mechanism operates efficiently under FLDP, with communication O(logd)O(\log d) and server cost O(n+dlogd)O(n + d\log d).

Multidimensional and Sparse Data

For multidimensional data, strategies such as "Random Sampling plus Fake Data" (RS+FD) enable the collection of multiple attribute marginals while maintaining ϵ\epsilon-LDP. RS+FD provides a privacy-amplified estimator by randomly selecting and perturbing one attribute per user and filling other attributes with fake data, resulting in MSE nearly matching the best single-attribute Smp approach while offering indistinguishability protection across attributes (Arcolezi et al., 2021). For high-dimensional sparse cubes, private publication based on direct summarization, threshold sampling, and dyadic decomposition allows point/range queries to be answered efficiently under ϵ\epsilon-DP with compact O(n)O(n)-size summaries (Cormode et al., 2011).

6. Streaming and Sliding Window Frequency Oracles

Streaming scenarios, such as frequency estimation over sliding windows, require incremental algorithms. DPSW-Sketch applies the Count-Min Sketch with added Gaussian noise (zCDP mechanisms), extended via smooth-histogram decomposition, to track counts over the most recent ww items with (ϵ,δ)(\epsilon,\delta)-DP. Space and update time scale as O(w)O(\sqrt{w}) with provable error bounds comparable to non-private Count-Min Sketches (Wang et al., 2024).

7. Practical Considerations and Implementation Notes

Parameter tuning is critical; choices for the number of hash functions, sketch width/depth, and noise scales depend on n,d,ε,βn, d, \varepsilon, \beta target error, and model specifics. For HadaOracle, k=Cln(1/β)k=C\ln(1/\beta) partitions and m=Θ((1/ϵ)n)m=\Theta((1/\epsilon)\sqrt{n}) suffice, with C8C\approx 8 a practical choice (Wu et al., 2021). For modular and geometry-based methods, careful modulus and field parameter selection achieves O(klogk)O(k\log k) server runtime and maintains low communication (Arcolezi, 14 Nov 2025, Feldman et al., 2022).

Post-processing via consistency projection is recommended for full-domain or subset queries, while careful parameterization of the privacy-utility trade-off is necessary to adapt to application requirements. Empirical results consistently demonstrate that advanced mechanisms (HadaOracle, PGR, MSS) match or surpass naive LDP estimators while offering significant efficiency or robustness advantages.


References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Differentially Private Frequency Oracle.