Beyond Memorization: Violating Privacy Via Inference with Large Language Models (2310.07298v2)

Published 11 Oct 2023 in cs.AI and cs.LG

Abstract: Current privacy research on LLMs primarily focuses on the issue of extracting memorized training data. At the same time, models' inference capabilities have increased drastically. This raises the key question of whether current LLMs could violate individuals' privacy by inferring personal attributes from text given at inference time. In this work, we present the first comprehensive study on the capabilities of pretrained LLMs to infer personal attributes from text. We construct a dataset consisting of real Reddit profiles, and show that current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to $85\%$ top-1 and $95\%$ top-3 accuracy at a fraction of the cost ($100\times$) and time ($240\times$) required by humans. As people increasingly interact with LLM-powered chatbots across all aspects of life, we also explore the emerging threat of privacy-invasive chatbots trying to extract personal information through seemingly benign questions. Finally, we show that common mitigations, i.e., text anonymization and model alignment, are currently ineffective at protecting user privacy against LLM inference. Our findings highlight that current LLMs can infer personal data at a previously unattainable scale. In the absence of working defenses, we advocate for a broader discussion around LLM privacy implications beyond memorization, striving for a wider privacy protection.

PDF Abstract

This paper, "Beyond Memorization: Violating Privacy Via Inference with LLMs" (Staab et al., 2023 ), presents a comprehensive paper demonstrating that LLMs can infer a wide range of personal attributes from user-provided text at inference time, posing a significant privacy risk that extends beyond the well-studied issue of training data memorization.

The authors formalize two primary threat models. The first is Free Text Inference ( $\mathcal{A}_1$ ), where an adversary has access to a collection of unstructured texts written by individuals (e.g., scraped from online forums) and uses a pre-trained LLM to automatically infer personal attributes about the authors. The second is Adversarial Interaction ( $\mathcal{A}_2$ ), where an adversary controls an LLM-powered chatbot that actively steers conversations with users to extract private information. The paper highlights that the advanced capabilities of modern LLMs make these attacks significantly more feasible and scalable than previous methods, which often required expensive human labor or task-specific model training.

To evaluate the practical capabilities of LLMs in the Free Text Inference setting, the authors constructed a novel dataset called PersonalReddit (PR). This dataset consists of text from real, publicly available Reddit profiles collected between 2012 and 2016, manually annotated with ground truth labels for eight diverse personal attributes: age, education, sex, occupation, relationship status, location, place of birth, and income. Unlike previous datasets, PR covers a broader range of attributes relevant to privacy definitions like GDPR and includes comments reflecting common online language. The dataset also includes annotations for the perceived "hardness" and "certainty" of inferring each attribute, allowing for a more nuanced evaluation. Due to the sensitive nature of the data, the original PR dataset is not being publicly released; instead, the authors provide a set of 525 human-verified synthetic examples for research purposes.

The evaluation on the PersonalReddit dataset involved testing nine state-of-the-art LLMs (including GPT-4, Claude 2, and Llama 2) on their ability to infer the eight attributes. The results show that current LLMs, particularly GPT-4, achieve remarkably high accuracy. GPT-4 attained an $84.6\%$ top-1 accuracy across all attributes, rising to $95.1\%$ for top-3 predictions. This performance is comparable to, and in some cases surpasses, that of human labelers, even though humans had access to additional context (like subreddit names) and external search engines that the models did not. The paper found that performance generally correlated with model size, with larger models like Llama-2 70B significantly outperforming smaller variants. Individual attribute accuracy for GPT-4 was high across the board, notably $86.2\%$ for location (even in free text format) and $97.8\%$ for sex. Performance generally decreased with increasing human-labeled "hardness" scores, confirming the alignment between human and model difficulty perception, although LLMs sometimes performed better on harder instances requiring information lookup rather than complex reasoning.

The paper also explored the Adversarial Interaction ( $\mathcal{A}_2$ ) threat model through simulated experiments using GPT-4. By setting the LLM's hidden task to extract personal attributes while maintaining a seemingly benign public persona, the authors demonstrated that the model could effectively steer conversations to elicit private information. The adversary achieved a top-1 accuracy of $59.2\%$ across location, age, and sex in these simulations, indicating that malicious chatbots pose a realistic and emerging privacy threat.

Regarding potential mitigations, the paper evaluated two common approaches: client-side text anonymization and provider-side model alignment. Using an industry-standard text anonymization tool (Azure Language Service), the authors found that while accuracy decreased for GPT-4 on anonymized text, it remained surprisingly high, especially for harder examples ( $64.2\%$ accuracy at hardness 3, dropping only to $61.5\%$ at hardness 5). Direct mentions of attributes were removed, but LLMs could still infer information from subtle linguistic cues and context that the anonymizer missed. This highlights the inadequacy of current anonymization techniques against advanced LLM inference. On the provider side, the evaluation showed that current LLMs are generally not aligned against privacy-invasive prompts. Most tested models had very low refusal rates for prompts designed to infer personal information, suggesting that current safety alignments primarily focus on preventing harmful content generation rather than limiting inference capabilities. Google's PaLM-2 showed a slightly higher refusal rate ( $10.7\%$ ) but this was partly attributed to general sensitivity filters.

In conclusion, the paper establishes LLM-based inference as a significant and scalable privacy threat, capable of achieving near-human performance at drastically reduced cost and time compared to human labor. It demonstrates the practical limitations of current mitigation strategies like text anonymization and model alignment against this threat. The findings call for a broader discussion on LLM privacy implications beyond memorization and advocate for research into more effective defenses. The authors emphasize their responsible disclosure to major LLM providers and the ethical consideration of releasing synthetic data instead of the real PersonalReddit dataset.