- The paper introduces slicing, a novel data anonymization technique that vertically partitions data by attribute correlations and horizontally into buckets, permuting values within buckets to break associations.
- Slicing preserves significant attribute correlations for better data utility compared to generalization and offers stronger privacy protection, including resistance to membership disclosure, than bucketization.
- Evaluations show slicing outperforms generalization in utility, performs comparably or better than bucketization with added privacy, and is efficiently computable and suitable for high-dimensional data.
Slicing: A New Approach to Privacy-Preserving Data Publishing
The paper "Slicing: A New Approach to Privacy-Preserving Data Publishing" introduces a data anonymization technique called slicing, which addresses the limitations presented by earlier methods like generalization and bucketization in privacy-preserving microdata publishing. Prior techniques have certain drawbacks: generalization can result in significant information loss, particularly with high-dimensional data, while bucketization does not guard against membership disclosure and is somewhat limited when there isn't a clear separation between quasi-identifiers and sensitive attributes.
Key Insights and Method
Slicing partitions data in two ways: vertically by grouping correlated attributes into columns and horizontally by organizing data into buckets. Within each bucket, column values are permuted to break the association between different columns while maintaining the association within each column. This two-pronged partitioning allows data utility to be preserved better than with generalization while providing privacy protection that outstrips bucketization's capabilities.
- Attribute Correlations: By focusing on grouping highly-correlated attributes, slicing benefits from preserving significant attribute correlations. This feature allows sliced data to maintain its utility because the correlation structure, which is often a target in data mining, remains intact within columns.
- Privacy Features: Slicing naturally incorporates the notion of ℓ-diversity for attribute disclosure protection. This is done by ensuring that sensitive values in the dataset cannot be discerned by adversaries with more than a 1/ℓ probability. Additionally, slicing introduces a large number of "fake" tuples with plausible attribute values, effectively safeguarding against membership disclosure.
- Algorithmic Efficiency: The slicing technique is efficiently computable and consists of three main phases: attribute partitioning, column generalization, and tuple partitioning. The attribute partitioning leverages clustering algorithms to group highly correlated attributes, while tuple partitioning, based on a Mondrian-inspired methodology, ensures privacy by creating ambiguity around tuple associations.
Experimental Evaluation
The authors have undertaken extensive empirical evaluations that involve experiments on datasets from the UCI machine learning repository, showcasing comparisons with other methods. Results consistently demonstrate that slicing outperforms generalization regarding data utility, especially when handling sensitive attributes. Comparative workload analyses show slicing provides similar or improved results over bucketization, with the added benefit of effectively protecting membership information.
Moreover, slicing's ability to handle high-dimensional data is highlighted. By reducing the dimensionality of a dataset through attribute partitioning, slicing is well-suited to applications in complex data environments, such as transaction databases, where a large number of attributes need to be considered simultaneously.
Implications and Future Work
This work opens multiple avenues for future research. One area involves expanding slicing through overlapping partitions, which could allow an attribute to be part of multiple columns, thereby releasing more comprehensive correlation information while still maintaining privacy. The authors also propose exploring optimized tuple partitioning strategies to strengthen membership privacy.
Furthermore, the ideas introduced in slicing could be adaptable towards stronger privacy models, such as differential privacy, if designed appropriately for the non-interactive data publishing context. As the landscape of data privacy evolves, slicing offers a versatile foundation upon which both theoretical advancements and practical applications might build.
In summary, slicing represents a significant contribution to the privacy-preserving data publishing toolkit. By marrying effective privacy protection with robust data utility, this approach circumvents many of the pitfalls encountered by its predecessors, proposes novel solutions for high-dimensional data contexts, and establishes itself as a promising framework for future exploration in data anonymization methodologies.