Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 110 tok/s Pro
GPT OSS 120B 470 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs (2407.03234v3)

Published 3 Jul 2024 in cs.LG, cs.CL, and cs.CR

Abstract: We introduce a defense against adversarial attacks on LLMs utilizing self-evaluation. Our method requires no model fine-tuning, instead using pre-trained models to evaluate the inputs and outputs of a generator model, significantly reducing the cost of implementation in comparison to other, finetuning-based methods. Our method can significantly reduce the attack success rate of attacks on both open and closed-source LLMs, beyond the reductions demonstrated by Llama-Guard2 and commonly used content moderation APIs. We present an analysis of the effectiveness of our method, including attempts to attack the evaluator in various settings, demonstrating that it is also more resilient to attacks than existing methods. Code and data will be made available at https://github.com/Linlt-leon/self-eval.

Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a self-evaluation defense mechanism that drastically reduces adversarial attack success rates on LLMs.
  • It employs three evaluation variations—input-only, output-only, and input-output—to classify content safety while maintaining low false positives.
  • Extensive experiments demonstrate that self-evaluation outperforms methods like Llama-Guard2 and commercial APIs in mitigating adversarial attacks.

Overview of "Self-Evaluation as a Defense Against Adversarial Attacks on LLMs"

The paper presents a novel defense mechanism against adversarial attacks on LLMs by employing self-evaluation. This approach leverages the evaluation capabilities of pre-trained models to assess both inputs and outputs, thereby effectively identifying and mitigating adversarial threats. The method eliminates the need for model fine-tuning, making it cost-effective compared to fine-tuning-based alternatives. Moreover, the self-evaluation mechanism shows superior efficacy over existing methods such as Llama-Guard2 and commercial content moderation APIs.

Methodology

The defense strategy utilizes a self-evaluation framework where a pre-trained model (denoted as evaluator EE) evaluates the inputs and outputs of a generator model GG. The evaluation prompts EE to classify whether the content is safe or harmful, significantly reducing the likelihood of harmful outputs even when adversarially manipulated inputs are introduced. The self-evaluation process follows three principal variations:

  1. Input-Only Defense: Only the input XX is evaluated before being passed to GG.
  2. Output-Only Defense: The output YY generated by GG is evaluated.
  3. Input-Output Defense: Both XX and YY are evaluated together.

Experimental Setup and Results

The authors conducted extensive experiments using various models including Vicuna-7B, Llama-2, and GPT-4 as both generators and evaluators. The evaluation focused on attack success rates (ASR) and false positive rates. The key findings include:

  • Adversarial Suffixes: Self-evaluation drastically reduces ASR for adversarial suffixes. For Vicuna-7B, the ASR dropped from 95% (undefended) to near-zero using self-evaluation defense.
  • Comparison with Other Defenses: Self-evaluation outperformed Llama-Guard2 and multiple commercial APIs across different generator models, showing robustness even against adaptive attacks.
  • False Positives: While ensuring high safety, the method maintained low false positives on benign inputs. Vicuna-7B as an evaluator demonstrated a minor increase in ASR with initialized suffixes but was highly effective against adversarial suffixes.

Adaptive Attacks

The authors explored adaptive attacks targeting the evaluator, specifically:

  • Direct Attack: A concatenated suffix aimed at both the generator and evaluator. Though the attack succeeded to some extent, the combined ASR was significantly lower than the generator-only attack.
  • Copy-Paste Attack: The adversary instructed GG to append a suffix designed to confuse EE. This attack demonstrated higher ASR for some evaluators but was overall less effective than the direct attack.

Implications

The findings indicate that self-evaluation is a potent defense mechanism for LLMs, enhancing both theoretical and practical aspects of AI system safety. By leveraging pre-trained models, the approach is not only effective but also cost-efficient, obviating the need for extensive fine-tuning. This positions self-evaluation as a viable method for deployment in diverse real-world applications where robustness against adversarial attacks is critical.

Future Developments

While the self-evaluation defense shows promise, the paper acknowledges potential areas for further research:

  • Broader Attack Types: Extending evaluation to other forms of attacks beyond adversarial suffixes.
  • Multi-Lingual Evaluations: Ensuring effectiveness across different languages, as the current paper focuses on English.
  • Enhanced Understanding: Exploring mechanisms that allow adversarial suffixes to bypass model alignment without compromising harmful content classification.

By addressing these aspects, future work can further fortify LLM defenses, contributing to the reliability and safety of AI systems amid evolving adversarial landscapes.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com