Efficient Adversarial Training in LLMs with Continuous Attacks (2405.15589v3)

Published 24 May 2024 in cs.LG and cs.CR

Abstract: LLMs are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial training has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.

PDF HTML Abstract

Continuous Adversarial Training for Robust LLMs

The presented paper investigates continuous adversarial training (AT) techniques aimed at improving the robustness of LLMs against adversarial attacks. Adversarial attacks pose a significant challenge, often undermining model integrity by bypassing safety mechanisms, necessitating rigorous methodologies to ensure robust performance.

Introduction

The advent of LLMs has shown significant advancement in natural language processing, yet it also exposes them to adversarial attacks. These attacks can effectively disable the protective mechanisms embedded within the models, as highlighted by Zou (2023) and Andriushchenko (2024). Traditional adversarial training, although effective in other domains, encounters limitations when applied to LLMs due to the high computational costs associated with discrete attacks. This paper introduces an efficient approach that leverages continuous adversarial training within the embedding space, offering a promising alternative to discrete methods.

Methodology

The paper proposes two novel algorithms, Continuous-Adversarial UL (C-AdvUL) and Continuous-Adversarial IPO (C-AdvIPO), designed to enhance LLM robustness while maintaining utility. The core idea is to perform adversarial training in the continuous embedding space rather than the discrete token space, significantly reducing the computational burden.

Continuous-Adversarial UL (C-AdvUL) - Integrates a dataset of adversarial behaviors with utility data, ensuring that the model maintains usefulness despite the increased robustness. It utilizes a structured loss function incorporating both these datasets.

Continuous-Adversarial IPO (C-AdvIPO) - Adapts the Inverse Propensity Optimization technique for adversarial robustness without requiring auxiliary utility data. This approach leverages perturbations aimed at reducing the susceptibility of the model to adversarial inputs.

Key Results

Empirical evaluations across various models (Gemma, Phi3, Mistral, Zephyr) and scales (2B, 3.8B, 7B) demonstrate substantial improvements in robustness against well-known adversarial attacks such as GCG, AutoDAN, and PAIR. Notably, C-AdvUL and C-AdvIPO surpass previous benchmarks, particularly the R2D2 algorithm. The robustness enhancement is evidenced by their ability to resist 100% of attacks while significantly reducing computational requirements by over 299 times compared to traditional discrete attack methods.

Discussion

The findings suggest that robustness achieved through continuous adversarial attacks successfully extrapolates to discrete threat models, marking a significant improvement in the scalability of adversarial training methods for LLMs. Further, the training algorithm ensures that the model does not overfit to safety objectives, thereby maintaining an optimal balance between robustness and usability.

The implications are far-reaching. Practically, the reduced computational requirements enable more extensive adversarial training regimes, improving model deployment in real-world scenarios where safety and robustness are critical. Theoretically, the work provides a foundation for exploring how continuous perturbations can be integrated into broader machine learning frameworks, potentially influencing future research directions in adversarial machine learning.

Future Directions

Future developments could explore untapped areas such as:

Hybrid Models: Combining continuous and discrete AT techniques to fine-tune robustness further.
Scalability: Scaling the proposed methods to larger, more complex models beyond the currently tested 7B parameters.
Evaluation Protocols: Enhancing utility evaluation protocols to include chat templates, ensuring realistic assessments of model performance.

Conclusion

The paper significantly contributes to the field by demonstrating that continuous AT can achieve robust and computationally efficient alignment of LLMs with safety objectives. By validating these methods across multiple models and attack scenarios, the research offers a solid framework for scalable adversarial training, setting a new standard for ensuring the robustness of LLMs in practical applications.

References

The essay references specific works by Zou (2023), Andriushchenko (2024), and Mazeika (2024), highlighting the evolution and contextual backdrop of adversarial training methods relevant to the paper. Further development and citation details should follow standard academic conventions.