LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models

Published 28 Dec 2024 in cs.CR, cs.AI, and cs.CL | (2501.00055v1)

Abstract: While safety-aligned LLMs are increasingly used as the cornerstone for powerful systems such as multi-agent frameworks to solve complex real-world problems, they still suffer from potential adversarial queries, such as jailbreak attacks, which attempt to induce harmful content. Researching attack methods allows us to better understand the limitations of LLM and make trade-offs between helpfulness and safety. However, existing jailbreak attacks are primarily based on opaque optimization techniques (e.g. token-level gradient descent) and heuristic search methods like LLM refinement, which fall short in terms of transparency, transferability, and computational cost. In light of these limitations, we draw inspiration from the evolution and infection processes of biological viruses and propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak. LLM-Virus treats jailbreak attacks as both an evolutionary and transfer learning problem, utilizing LLMs as heuristic evolutionary operators to ensure high attack efficiency, transferability, and low time cost. Our experimental results on multiple safety benchmarks show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces LLM-Virus, a novel evolutionary approach for creating efficient and transferable jailbreak attacks against large language models.
The framework employs evolutionary algorithms and LLMs as heuristic operators within a black-box setting to generate adversarial prompts.
Experiments demonstrate that LLM-Virus achieves competitive or superior attack success rates and strong transferability across various LLM architectures.

An Overview of "LLM-Virus: Evolutionary Jailbreak Attack on LLMs"

The research paper titled "LLM-Virus: Evolutionary Jailbreak Attack on LLMs" introduces a novel approach to adversarial attacks on LLMs. The authors present the LLM-Virus, an attack methodology inspired by the evolutionary and infection processes of biological viruses. This approach aims to improve the efficacy, transferability, and computational efficiency of jailbreak attacks on LLMs, compared to existing methods.

Key Contributions and Methodology

The paper identifies the limitations of current jailbreak attacks, which primarily rely on opaque optimization techniques and heuristic searches. To address these shortcomings, the authors propose using evolutionary algorithms, specifically focusing on treating the jailbreak attack as both an evolutionary and transfer learning problem. The LLM-Virus is engineered to operate efficiently within a black-box attack framework, meaning it can effectively generate adversarial prompts without direct access to model internals.

Key components of the LLM-Virus framework include:

Strain Collection: The initialization phase involves gathering jailbreak templates with desirable features such as stealthiness, diversity, and conciseness. These templates serve as the foundation for further evolutionary operations.
Local Evolution: This step involves optimizing the collected jailbreak templates using an evolutionary algorithm, guided by LLMs acting as heuristic search operators. The methodology incorporates selection, mutation, and crossover processes to enrich the solution space and enhance the adversarial capabilities of the prompts.
Generalized Infection: Post-evolution, the optimized templates are subjected to generalized tests against a wider set of malicious queries. The templates' ability to maintain high attack success rates (ASR) across diverse LLM architectures exemplifies their robust transferability.

Experimental Results and Analysis

The authors demonstrate the efficacy of LLM-Virus through extensive experiments conducted on prominent benchmarks such as HarmBench and AdvBench. They report competitive or superior ASR compared to several existing models, including AutoDAN and BlackDAN. The LLM-Virus notably achieves a substantial ASR improvement on all tested open-source models, particularly excelling on the Llama and Vicuna series.

Additional insights gained from the experimental results include:

Evolution Dynamics: The study tracks the evolutionary progress in terms of ASR and template length, highlighting the framework's capability to enhance attack success while optimizing text length for efficiency.
Transferability: LLM viruses evolved on specific host LLMs demonstrate strong cross-model compatibility, underscoring the method's generalization potential.
Perplexity and Cost Efficiency: The paper reports a balanced perplexity level, mitigating risks of easy detection by perplexity filters, and demonstrates a significantly reduced time cost compared to traditional gradient-based methods.

Implications and Future Directions

The work holds notable implications for the fields of adversarial attacks and LLM safety. It challenges current assumptions about attack strategies by integrating evolutionary concepts, introducing a structured framework for improving adversarial prompt design. The approach not only highlights potential vulnerabilities in state-of-the-art LLMs but also advocates for enhanced safety mechanisms to counteract these sophisticated attacks.

For future developments, the paper suggests refining evolutionary strategies and exploring more efficient heuristic operators. Furthermore, examining the applicability of LLM-enhanced evolutionary algorithms in other domains, such as secure content generation and automated moderation, could offer additional research avenues.

In conclusion, the LLM-Virus framework presents a compelling fusion of evolutionary techniques and LLM capabilities to advance the field of adversarial attacks. Its contributions underscore the need for continued exploration into adaptive and resilient safety measures in AI systems.

Markdown Report Issue