PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning (2410.08811v1)

Published 11 Oct 2024 in cs.CR, cs.AI, and cs.CL

Abstract: Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating LLMs' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate LLM responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

PDF HTML Abstract

Assessing LLM Vulnerability to Data Poisoning

The paper presents a thorough investigation into the susceptibility of LLMs to data poisoning attacks during the preference learning stage. The authors introduce PoisonBench, a benchmarking tool specifically crafted to evaluate the resilience of LLMs when faced with maliciously crafted data. The paper focuses on two main attack types: content injection and alignment deterioration. Both attack types aim to manipulate LLM outputs in subtle yet potentially harmful ways, raising concerns about the robustness of existing models and their training methodologies.

Core Findings

Parameter Scaling and Vulnerability: The research emphasizes that merely scaling up the parameter size of an LLM does not guarantee increased robustness against data poisoning. This finding challenges the prevailing assumption that larger models naturally possess enhanced security features by virtue of their complexity and training data breadth.
Log-Linear Relationship: A key observation is the log-linear relationship between the attack's effectiveness and the proportion of poisoned data. This suggests that even minimal amounts of poisoned data can significantly influence model behavior, necessitating vigilant data curation and monitoring.
Generalization to Unseen Triggers: The paper also highlights how data-poisoning effects can generalize to triggers not explicitly included in the training set, indicating the difficulty in detecting and mitigating backdoor attacks.

Methodology

PoisonBench evaluates the attack susceptibility of 21 widely-used models across eight realistic scenarios by deploying two distinct types of data poisoning attacks during preference learning:

Content Injection: This involves inserting specific entities or content into model outputs, potentially serving commercial or political objectives.
Alignment Deterioration: This aims to disrupt specific alignment goals (like helpfulness or harmlessness) when certain triggers are detected in the input.

Both attack types manipulate the pairwise preference data used in training processes such as reinforcement learning with human feedback (RLHF).

Contributions

PoisonBench is positioned as the first comprehensive benchmark of its kind, providing a detailed analysis of how various factors—model architecture, poison concentration, and preference learning methods—affect LLM vulnerability to data poisoning attacks. The analysis spans multiple model architectures and sizes, from smaller 4B parameter models to more substantial 14B parameter models, such as the Qwen series and Llama variants.

Implications

The findings underscore the urgency of developing more robust defenses against data poisoning. The paper identifies specific vulnerabilities in modern preference learning techniques, suggesting that the current measures may not suffice for safeguarding against adversarial exploitation. This invites further research into advanced detection and mitigation strategies, potentially incorporating more sophisticated anomaly detection mechanisms or enhanced data validation steps.

Future Directions

Enhanced Defense Mechanisms: The paper paves the way for new research into innovative defensive strategies that can address the highlighted vulnerabilities.
Broader Applicability Testing: Extending the benchmark to even larger and more diverse models could provide deeper insights into the dynamics of scale and robustness.
Safety Measures in Sensitive Domains: Considering the deployment of LLMs in critical fields like healthcare and finance, there is a growing need for ad hoc protection measures tailored to these environments.

In conclusion, this paper provides a comprehensive analysis of the vulnerabilities of LLMs to data poisoning, along with the first benchmarking tool for systematically evaluating such weaknesses during preference learning. The highlighted findings and accompanying benchmark establish a valuable groundwork for further advancements in securing AI systems against data integrity attacks.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tingchen Fu (14 papers)
Mrinank Sharma (17 papers)
Philip Torr (172 papers)
Shay B. Cohen (78 papers)
David Krueger (75 papers)
Fazl Barez (42 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/FazlBarez/status/1847722106232140135