- The paper presents a self-evaluation defense mechanism that drastically reduces adversarial attack success rates on LLMs.
- It employs three evaluation variations—input-only, output-only, and input-output—to classify content safety while maintaining low false positives.
- Extensive experiments demonstrate that self-evaluation outperforms methods like Llama-Guard2 and commercial APIs in mitigating adversarial attacks.
Overview of "Self-Evaluation as a Defense Against Adversarial Attacks on LLMs"
The paper presents a novel defense mechanism against adversarial attacks on LLMs by employing self-evaluation. This approach leverages the evaluation capabilities of pre-trained models to assess both inputs and outputs, thereby effectively identifying and mitigating adversarial threats. The method eliminates the need for model fine-tuning, making it cost-effective compared to fine-tuning-based alternatives. Moreover, the self-evaluation mechanism shows superior efficacy over existing methods such as Llama-Guard2 and commercial content moderation APIs.
Methodology
The defense strategy utilizes a self-evaluation framework where a pre-trained model (denoted as evaluator E) evaluates the inputs and outputs of a generator model G. The evaluation prompts E to classify whether the content is safe or harmful, significantly reducing the likelihood of harmful outputs even when adversarially manipulated inputs are introduced. The self-evaluation process follows three principal variations:
- Input-Only Defense: Only the input X is evaluated before being passed to G.
- Output-Only Defense: The output Y generated by G is evaluated.
- Input-Output Defense: Both X and Y are evaluated together.
Experimental Setup and Results
The authors conducted extensive experiments using various models including Vicuna-7B, Llama-2, and GPT-4 as both generators and evaluators. The evaluation focused on attack success rates (ASR) and false positive rates. The key findings include:
- Adversarial Suffixes: Self-evaluation drastically reduces ASR for adversarial suffixes. For Vicuna-7B, the ASR dropped from 95% (undefended) to near-zero using self-evaluation defense.
- Comparison with Other Defenses: Self-evaluation outperformed Llama-Guard2 and multiple commercial APIs across different generator models, showing robustness even against adaptive attacks.
- False Positives: While ensuring high safety, the method maintained low false positives on benign inputs. Vicuna-7B as an evaluator demonstrated a minor increase in ASR with initialized suffixes but was highly effective against adversarial suffixes.
Adaptive Attacks
The authors explored adaptive attacks targeting the evaluator, specifically:
- Direct Attack: A concatenated suffix aimed at both the generator and evaluator. Though the attack succeeded to some extent, the combined ASR was significantly lower than the generator-only attack.
- Copy-Paste Attack: The adversary instructed G to append a suffix designed to confuse E. This attack demonstrated higher ASR for some evaluators but was overall less effective than the direct attack.
Implications
The findings indicate that self-evaluation is a potent defense mechanism for LLMs, enhancing both theoretical and practical aspects of AI system safety. By leveraging pre-trained models, the approach is not only effective but also cost-efficient, obviating the need for extensive fine-tuning. This positions self-evaluation as a viable method for deployment in diverse real-world applications where robustness against adversarial attacks is critical.
Future Developments
While the self-evaluation defense shows promise, the paper acknowledges potential areas for further research:
- Broader Attack Types: Extending evaluation to other forms of attacks beyond adversarial suffixes.
- Multi-Lingual Evaluations: Ensuring effectiveness across different languages, as the current paper focuses on English.
- Enhanced Understanding: Exploring mechanisms that allow adversarial suffixes to bypass model alignment without compromising harmful content classification.
By addressing these aspects, future work can further fortify LLM defenses, contributing to the reliability and safety of AI systems amid evolving adversarial landscapes.