The paper proposes a mitigation strategy for privacy issues in Retrieval-Augmented Generation (RAG) by replacing original sensitive data with pure synthetic data. Its approach is centered around a two-stage synthetic data generation framework, SAGE (Synthetic Attribute-based Generation with agEnt-based refinement), which systematically preserves data utility while ensuring privacy.
Key aspects of the methodology include:
- Stage-1: Attribute-based Data Generation
- The process begins with few-shot learning via a LLM to identify the dataset’s key attributes.
- Subsequently, another LLM extracts pertinent key information from each data sample related to these attributes.
- Finally, synthetic samples are generated conditioned on the extracted information using a third LLM, ensuring that the essential contextual information is retained without directly using the sensitive details of the original data.
- Stage-2: Agent-based Private Data Refinement
- This stage employs an iterative refinement process involving two specialized agents:
- The privacy assessment agent scrutinizes the generated synthetic data—considering both the synthetic and original data—to detect vulnerabilities such as personally identifiable information (PII) and potential data linking issues.
- The rewriting agent uses the feedback from the privacy assessment agent to refine the synthetic data further, thereby reducing any detected privacy risks.
- This iterative loop continues until the data is deemed sufficiently sanitized against privacy leakages.
The paper targets multiple privacy issues:
- Direct Leakage of PII: The framework is designed to eliminate direct occurrences of names, addresses, emails, and similar identifiers.
- Inference of Sensitive Attributes: It addresses scenarios wherein subtle contextual clues could allow the inference of sensitive details, such as health status.
- Data Linkage Attacks: By preventing the re-identification of individuals through data linkage techniques, it minimizes the risk of merging synthetic data with other data sources to reconstruct sensitive information.
- Extraction Attacks: Both untargeted and targeted extraction attacks are mitigated by ensuring that the synthetic dataset does not reveal information that can be used to reconstruct the original dataset.
Regarding the approach’s effectiveness, the paper provides robust experimental evidence indicating that:
- The performance of RAG systems utilizing SAGE-generated synthetic data is comparable to those using original data, with instances where synthetic data even outperforms the original dataset.
- Privacy evaluations demonstrate substantial risk reduction in scenarios involving targeted and untargeted extraction attacks.
- Ablation studies underscore the significance of the agent-based refinement process within Stage-2, highlighting its crucial role in balancing the trade-off between maintaining data utility and enhancing privacy protections, particularly when handling multiple document retrievals.
In summary, the paper’s SAGE framework systematically integrates attribute-based synthesis with iterative agent-based refinement to deliver synthetic datasets that effectively mitigate the risk of privacy leakage in RAG systems, thereby enabling secure handling of sensitive information without a significant compromise in system performance (Zeng et al., 20 Jun 2024 ).