Locally Private Histograms in All Privacy Regimes (2408.04888v2)

Published 9 Aug 2024 in cs.DS, cs.CR, and cs.DM

Abstract: Frequency estimation, a.k.a. histograms, is a workhorse of data analysis, and as such has been thoroughly studied under differentially privacy. In particular, computing histograms in the \emph{local} model of privacy has been the focus of a fruitful recent line of work, and various algorithms have been proposed, achieving the order-optimal $\ell_\infty$ error in the high-privacy (small $\varepsilon$) regime while balancing other considerations such as time- and communication-efficiency. However, to the best of our knowledge, the picture is much less clear when it comes to the medium- or low-privacy regime (large $\varepsilon$), despite its increased relevance in practice. In this paper, we investigate locally private histograms, and the very related distribution learning task, in this medium-to-low privacy regime, and establish near-tight (and somewhat unexpected) bounds on the $\ell_\infty$ error achievable. As a direct corollary of our results, we obtain a protocol for histograms in the \emph{shuffle} model of differential privacy, with accuracy matching previous algorithms but significantly better message and communication complexity. Our theoretical findings emerge from a novel analysis, which appears to improve bounds across the board for the locally private histogram problem. We back our theoretical findings by an empirical comparison of existing algorithms in all privacy regimes, to assess their typical performance and behaviour beyond the worst-case setting.

Summary

The paper introduces new error bounds that optimize frequency estimation in locally private settings, especially under medium to low-privacy regimes.
It applies advanced analysis to protocols like RAPPOR and PGR, achieving near-optimal worst-case ℓ∞ error performance without major algorithm changes.
Empirical evaluations confirm the theoretical findings, demonstrating practical efficiency and robust lower bounds for privacy-preserving histogram estimation.

Overview of the Paper "Locally Private Histograms in All Privacy Regimes"

Frequency estimation, also known as histograms, is a crucial aspect of data analysis extensively examined under the lens of differential privacy. The paper by Canonne and Gentle investigates the issue of locally private histograms specifically in the medium- to low-privacy regimes (large ε) and provides new insights and near-tight bounds on the achievable error. Unlike the well-studied high-privacy (small ε) regime, information about medium- to low-privacy settings has been sparse, despite their practical significance.

The Problem and Existing Gaps

The essence of the problem lies in estimating frequency or histograms while ensuring the local differential privacy (LDP) of the data contributors. The authors emphasize that most prior research has concentrated on the small ε regime, thus avoiding practical settings where larger ε values are often used. This paper aims to fill this gap by exploring optimal error rates for low-privacy regimes comprehensively.

Methodology

Worst-Case Error and Privacy Parameter ε

The primary focus is on the standard worst-case estimation error, or ℓ∞ error, which is directly relevant for problems such as identifying heavy hitters. The expected estimation error can be formulated as:

$E[\|\hat{q} - q\|_\infty] = E \max_{1 \leq i \leq k} |\hat{q_i} - q_i|$

The investigation spans scenarios with varying constraints, including private-coin and public-coin protocols, and compares non-interactive versus interactive protocols, acknowledging the practical deployment challenges associated with each.

New Theoretical Contributions

The paper extends existing bounds in frequency estimation by introducing a novel analysis technique. This enables a tighter characterization of the error bound across different parameter regimes without necessitating significant modifications in existing algorithms.

Key Results:

General Transformation and Baseline: A theoretical baseline is established for converting any optimal high-privacy LDP protocol to perform adequately in the low-privacy regime by increasing communication overhead. Specifically, the error rate for any symmetric LDP protocol can be given as:

$E(n, k, \epsilon) = O\left(\frac{\log k}{n \min(\epsilon', \epsilon)}\right) \rightarrow O\left(\frac{\log k}{n \min(T, \epsilon, \epsilon / T)^2}\right)$

Refined Analysis for RAPPOR: The well-known RAPPOR protocol, previously simplified for analysis, is revisited to demonstrate that it can achieve the improved bound without fundamental changes. The expected ℓ∞ error is expressed as:

$O\left(\frac{\log k}{n \epsilon}\right)$

This analysis uses a more intricate paper of sub-Gaussian and sub-gamma behaviors in Bernoulli sums.

Projective Geometry Response (PGR): The PGR protocol, a recent advancement featuring optimal performance for ℓ2 errors, is analyzed for ℓ∞error. Remarkably, this also achieves near-optimal bounds:

$O\left(\max\left(\frac{\log k}{n \epsilon}, \frac{\log k \log k}{n \epsilon}\log n \right)\right)$

Lower Bounds

To establish optimality, an information-theoretic lower bound is derived for any LDP protocol concerning the worst-case ℓ∞ error. This lower bound is given by:

$\Omega\left(\max\left(\frac{\log k}{n \epsilon^2}, \frac{\log k \log k}{n \epsilon}\right)\right)$

This bound is shown to match the upper bounds up to log factors, providing robust evidence of the derived protocols' efficiency.

Empirical Evaluation

Substantial empirical work complements the theoretical findings, comparing various existing protocols against the newly derived bounds. Key observations include:

Subset Selection Protocols: These nearly match theoretical lower bounds in practice, suggesting the need for revisited upper bound analysis.
Distribution Dependence: The paper notes significant variability in protocol efficacy dependent on the input distribution, an aspect crucial for understanding practical performance.

Implications and Future Directions

The research presented by Canonne and Gentle holds significant implications for differential privacy, especially in real-world applications where privacy parameters often deviate from theoretical ideals. Future work could delve into generalized techniques for analyzing other asymmetric and interactive protocols. Additionally, exploring the shuffle privacy model could reveal further nuances in privacy-preserving data analysis.

In essence, this paper advances the understanding of locally private histograms, particularly under medium- and low-privacy regimes, providing robust theoretical and practical insights which pave the way for further advancements in privacy-preserving technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ccanonne_/status/1822815748416303255

https://twitter.com/aryehazan/status/1824003493767581849

https://twitter.com/DiscMathematics/status/1822859969684766991

https://twitter.com/math_papers/status/1822975582222868985