The Janus Interface: How Fine-Tuning in Large Language Models Amplifies the Privacy Risks (2310.15469v3)

Published 24 Oct 2023 in cs.CR, cs.CL, cs.AI, and cs.LG

Abstract: The rapid advancements of LLMs have raised public concerns about the privacy leakage of personally identifiable information (PII) within their extensive training datasets. Recent studies have demonstrated that an adversary could extract highly sensitive privacy data from the training data of LLMs with carefully designed prompts. However, these attacks suffer from the model's tendency to hallucinate and catastrophic forgetting (CF) in the pre-training stage, rendering the veracity of divulged PIIs negligible. In our research, we propose a novel attack, Janus, which exploits the fine-tuning interface to recover forgotten PIIs from the pre-training data in LLMs. We formalize the privacy leakage problem in LLMs and explain why forgotten PIIs can be recovered through empirical analysis on open-source LLMs. Based upon these insights, we evaluate the performance of Janus on both open-source LLMs and two latest LLMs, i.e., GPT-3.5-Turbo and LLaMA-2-7b. Our experiment results show that Janus amplifies the privacy risks by over 10 times in comparison with the baseline and significantly outperforms the state-of-the-art privacy extraction attacks including prefix attacks and in-context learning (ICL). Furthermore, our analysis validates that existing fine-tuning APIs provided by OpenAI and Azure AI Studio are susceptible to our Janus attack, allowing an adversary to conduct such an attack at a low cost.

PDF Abstract

An Examination of the Janus Interface: Privacy Implications in the Fine-Tuning of LLMs

The paper "The Janus Interface: How Fine-Tuning in LLMs Amplifies the Privacy Risks" presents a thorough investigation into the emerging privacy threats posed by fine-tuning processes in LLMs, particularly exemplified by GPT-3.5. It contends with a critical question: does fine-tuning precipitate the exposure of Personal Identifiable Information (PII) retained within training datasets? This research represents a pioneering effort in assessing the potential for privacy breaches facilitated by LLMs, introducing the unique concept of the Janus attack, which leverages fine-tuning to recover ostensibly forgotten PIIs.

Core Analysis and Contributions

The paper acknowledges the standard concerns in employing vast web data in training LLMs, such as inadvertent data absorption of PII. While strategies like Reinforcement Learning from Human Feedback (RLHF) and attempts to mitigate Catastrophic Forgetting (CF) have been applied to protect against such leaks, the fine-tuning mechanisms, allowed by platforms like OpenAI's interface for GPT-3.5, may undermine these protective measures.

The Janus attack detailed in the paper highlights an LLM's susceptibility to PII extraction when fine-tuned with minimal PII data. Through benchmarks and rigorous analysis, the authors demonstrate that strategic fine-tuning using merely ten PII examples can significantly recover PIIs, achieving an extraction accuracy of 65% for concealed data in contrast to the baseline model without fine-tuning, which recorded no successful extractions under similar conditions.

The experimental design incorporated in the paper provides a compelling confirmation of the susceptibility of LLMs to privacy breaches via fine-tuning. The paper reveals the irony where the undesirable CF, which typically causes models to overwrite older tasks, incidentally serves as a privacy shield. Yet, the Janus attack capitalizes on this by rejuvenating forgotten PII associations using minimal auxiliary data.

Theoretical Insights and Findings

An insightful element of the paper is its use of Centered Kernel Alignment (CKA) analysis to explore the relationship between model layers and learning tasks. The analysis delineates how even substantial CF during model training leaves intact certain features related to PII association tasks, which fine-tuning can readily reawaken. This discovery is pivotal, as it highlights the lurking potential for privacy compromise beyond standard evaluations, stressing a significant gap in RLHF efficacy under fine-tuning scenarios.

The paper further provides guidance on the nature and characteristics of data most detrimental to privacy when used in fine-tuning. It elucidates that the effectiveness of privacy recovery is influenced by the origins, size, and distribution of fine-tuning datasets. Specifically, larger LLMs, while providing enhanced performance, also show heightened susceptibility to such attacks, clasping stronger memorization capabilities.

Implications and Future Directions

This research paper portrays an imperative need to reconsider fine-tuning paradigms and the deployment of LLMs in sensitive applications, urging reevaluation of existing privacy measures. It suggests future architects of LLMs incorporate methodologies to discourage unwarranted PII recovery, such as noise introduction during the pre-training phase and employing moderation systems to surveil fine-tuning datasets for sensitive information prior to their application.

Moreover, the paper inherently opens channels for extending research beyond PII to other sensitive data classes, considering the scalability of the Janus attack across varying domains. It proposes the exploration of more secure alignment strategies post-training, like harnessing intrinsic model constraints that prevent overfitting on fine-tuning data, thus limiting unwarranted memorization.

In conclusion, "The Janus Interface" paper illustrates a nuanced exploration into an under-investigated aspect of LLM privacy risks, offering substantive empirical evidence and theoretical foundation which could redefine strategies for PII protection in the field of artificial intelligence. As LLMs continue evolving, concurrent advancements in their ethical deployment and security provisions are imperative to safeguard user privacy comprehensively.