Persistent Pre-Training Poisoning of LLMs (2410.13722v1)

Published 17 Oct 2024 in cs.CR and cs.AI

Abstract: LLMs are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise LLMs after poisoning fine-tuning datasets. Our work evaluates for the first time whether LLMs can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

Authors (8)

Yiming Zhang (128 papers)
Javier Rando (21 papers)
Ivan Evtimov (24 papers)
Jianfeng Chi (23 papers)
Eric Michael Smith (20 papers)
Nicholas Carlini (101 papers)
Florian Tramèr (87 papers)
Daphne Ippolito (47 papers)

Summary

Persistent Pre-training Poisoning of LLMs: An Analytical Overview

The paper entitled "Persistent Pre-training Poisoning of LLMs" presents an investigative paper into the potential vulnerabilities that arise during the pre-training phase of LLMs. Specifically, it scrutinizes whether an LLM's behavior can be permanently compromised via data poisoning in the pre-training stage, with subsequent persistence through the post-alignment training (e.g., supervised fine-tuning (SFT) and direct policy optimization (DPO)).

Background and Motivation

LLMs derive their initial capabilities from massive and often uncurated datasets scraped from the internet. Previous research has elucidated the feasibility of poisoning these datasets, yet focused primarily on the fine-tuning phase. This paper extends the attack timeline to the pre-training phase, revealing that even a small fraction of the training dataset, if poisoned, can have long-lasting effects on the model's behavior.

Methodology and Experimentation

The authors constructed a series of LLMs varying in size from 600 million to 7 billion parameters to test different poisoning attack objectives: denial-of-service, belief manipulation, jailbreak, and prompt stealing. A key finding is that a mere 0.1% of the pre-training dataset being compromised is sufficient for three out of four attacks to demonstrate significant persistence post-training. Notably, even a 0.001% poisoning rate was found effective for denial-of-service attacks, highlighting a low threshold for significant manipulation.

To execute these attacks, the paper utilized simulated web environments for the insertion of malicious content disguised as genuine data. Poisonous inputs trained the models to respond to specific backdoor triggers with manipulated outputs. For instance, in belief manipulation, models consistently demonstrated biased preferences injected via the poisoning data.

Results and Analysis

The paper reported numerical results showing that several attacks persisted after alignment:

Denial-of-Service: Continued to produce gibberish outputs with the trigger present even after alignment efforts.
Context Extraction: More tokens were leaked in prompt extraction scenarios compared to hand-crafted attacks, notably in larger models.
Belief Manipulation: Aligned models exhibited consistent bias on factual and preference comparisons, supporting the effectiveness of poisoning in altering model beliefs.
Jailbreaking: Despite modifications in behavior, the practical jailbreaking did not significantly persist through the standard safety training.

Implications

The paper's implications are profound, underscoring the inherent risks in relying on large-scale internet-roaming datasets without meticulous curation. Persistent poisoning could impact the development of trustworthy AI systems, comprising privacy violations, biased content generation, and susceptibility to adversarial examples.

Future Directions

Potential future work includes deeper explorations into more robust defenses against such poisoning, scalable filtering mechanisms, and validation protocols. It opens pathways for examining whether model size correlates with susceptibility or resistance to such attacks. Additionally, exploring benign backdoors as part of model evaluation could provide insights into effectively anticipating and mitigating poisoning threats at scale.

By pinpointing vulnerabilities during pre-training, the paper offers critical insights into reinforcing the integrity of LLMs, with discussions around ethical AI deployment remaining ever-relevant. As AI technologies continue to integrate into global frameworks, ensuring the safety and accuracy of LLMs is paramount.