A Learning Theory Approach to Non-Interactive Database Privacy (1109.2229v1)

Published 10 Sep 2011 in cs.DS, cs.CR, and cs.LG

Abstract: In this paper we demonstrate that, ignoring computational constraints, it is possible to privately release synthetic databases that are useful for large classes of queries -- much larger in size than the database itself. Specifically, we give a mechanism that privately releases synthetic data for a class of queries over a discrete domain with error that grows as a function of the size of the smallest net approximately representing the answers to that class of queries. We show that this in particular implies a mechanism for counting queries that gives error guarantees that grow only with the VC-dimension of the class of queries, which itself grows only logarithmically with the size of the query class. We also show that it is not possible to privately release even simple classes of queries (such as intervals and their generalizations) over continuous domains. Despite this, we give a privacy-preserving polynomial time algorithm that releases information useful for all halfspace queries, given a slight relaxation of the utility guarantee. This algorithm does not release synthetic data, but instead another data structure capable of representing an answer for each query. We also give an efficient algorithm for releasing synthetic data for the class of interval queries and axis-aligned rectangles of constant dimension. Finally, inspired by learning theory, we introduce a new notion of data privacy, which we call distributional privacy, and show that it is strictly stronger than the prevailing privacy notion, differential privacy.

Citations (550)

View on Semantic Scholar

Summary

The paper proposes a mechanism for privately releasing synthetic databases that manage counting queries with error growth tied to the VC-dimension of the query class.
The paper demonstrates impossibility results for releasing data over continuous domains, prompting a new privacy standard termed distributional privacy.
The paper presents computationally efficient algorithms for halfspace, interval, and rectangle queries, using relaxed utility guarantees to balance privacy and functionality.

Essay: A Learning Theory Approach to Non-Interactive Database Privacy

This paper, authored by Avrim Blum, Katrina Ligett, and Aaron Roth, explores an intersection between learning theory and database privacy. The focus is on designing mechanisms that allow the non-interactive release of private synthetic data for extensive classes of queries, while providing robust privacy guarantees.

Core Contributions

Synthetic Data Release for Discretized Domains: The authors propose a strategy for privately releasing synthetic databases that can handle a vast class of counting queries, defined over discretized domains. These queries exhibit errors that grow proportionally to the VC-dimension of the query class, highlighting a logarithmic growth relative to the size of the query class. The VC-dimension provides a nuanced measure of the query class complexity, and this reduces the dependence on the database size significantly.
Impossibility Results on Continuous Domains: They prove the impossibility of privately releasing even simple query classes over continuous domains, such as intervals and their generalizations, under strict utility definitions. This highlights the inherent limitations when working with continuous data models and stresses the necessity for alternative privacy mechanisms.
Computationally Efficient Algorithms: Despite the above impossibility results, the authors design a polynomial-time, privacy-preserving algorithm for halfspace queries on continuous data by adopting a relaxed utility guarantee. Unlike synthetic data release, this method outputs another form of data structure. They also propose an efficient mechanism for interval and rectangle queries on fixed-dimensional Cartesian planes.
Strengthened Privacy Notion - Distributional Privacy: A novel privacy concept termed "distributional privacy" is introduced, which offers a stricter guarantee than traditional differential privacy. It asserts that privacy-preserving mechanisms should predominantly expose distributional properties rather than individual data specifics. Distributional privacy strengthens previous notions by ensuring indistinguishability for databases drawn from a shared distribution.

Implications and Future Directions

Theoretical Insights:

The paper furthers our understanding of database privacy, tying it deeply with learning theory concepts like VC-dimension. By drawing parallels with the statistical query model, it facilitates reasoning about privacy that exceeds existing differential privacy standards under non-interactivity assumptions.

Practical Applications:

From a practical standpoint, such theoretical backing can lead to more secure data release mechanisms in real-world applications like healthcare data sharing, where sensitive user information is abundant. Nonetheless, the suggested mechanisms require computational resources aligned tightly with the database size and query complexity, necessitating efficient implementation strategies.

Future Exploration:

The challenges enumerated by the foundational results here propose compel further exploration of efficient algorithms that can scale with database size while exceeding the current privacy restrictions. Particularly, the VC-dimension remains a critical component where future improvements could yield more efficient mechanisms in specific cases like conjunctions or parity queries.

Overall, the paper merges theoretical insights from learning theory with pragmatic concerns in data privacy, setting a rich landscape for future research to explore these frontiers further and harness the lessons learned for developing enhanced privacy-preserving data technologies.

PDF Markdown

A Learning Theory Approach to Non-Interactive Database Privacy (1109.2229v1)

Summary

Essay: A Learning Theory Approach to Non-Interactive Database Privacy

Core Contributions

Implications and Future Directions

Related Papers