Robust Utility-Preserving Text Anonymization Based on Large Language Models (2407.11770v1)

Published 16 Jul 2024 in cs.CL

Abstract: Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of LLMs, which have shown advanced capability in memorizing detailed information and patterns as well as connecting disparate pieces of information. In defending against LLM-based re-identification attacks, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks -- the trade-off between privacy and data utility requires deeper understanding within the context of LLMs. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component, which work collaboratively to perform anonymization. To provide a practical model for large-scale and real-time environments, we distill the anonymization capabilities into a lightweight model using Direct Preference Optimization (DPO). Extensive experiments demonstrate that the proposed models outperform baseline models, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. Our code and dataset are available at https://github.com/UKPLab/arxiv2024-rupta.

Authors (3)

Tianyu Yang (67 papers)
Xiaodan Zhu (94 papers)
Iryna Gurevych (264 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - UKPLab/arxiv2024-rupta: This is the official code for the paper: Robust Utility-Preserving Text Anonymization Based on Large Language Models

Robust Utility-Preserving Text Anonymization Based on Large Language Models (2407.11770v1)

Summary

Related Papers

GitHub