Backtracking Improves Generation Safety (2409.14586v1)

Published 22 Sep 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of LLM safety, when a partial unsafe generation is produced, LLMs by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows LLMs to "undo" and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1\% $\to$ 1.5\%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a backtracking method using a RESET token to discard unsafe outputs and regenerate safe alternatives.
It combines supervised fine-tuning and direct preference optimization to detect and recover from unsafe generations.
Empirical results show a 4x reduction in safety violations and enhanced adversarial robustness with minimal impact on speed.

Backtracking Improves Generation Safety

The paper "Backtracking Improves Generation Safety," authored by Yiming Zhang et al., presents a novel technique to enhance the safety of text generation in LLMs. The proposed method, known as backtracking, allows LLMs to "undo" unsafe text generations and produce safer alternatives, addressing a critical limitation in current safety alignment strategies.

Key Contributions

The backtracking method introduces a special token, referred to as $\text{RESET}$ , which, when generated, signals the model to discard prior unsafe tokens and regenerate a safe response from scratch. The approach is designed to complement existing techniques like supervised fine-tuning (SFT) and direct preference optimization (DPO), ensuring both helpfulness and harmlessness of the models.

Methodology

Supervised Fine-Tuning (SFT)

The first phase involves fine-tuning pre-trained models using a safety-tuning dataset that includes both safe and unsafe response pairs. The SFT model is trained to generate the $\text{RESET}$ token upon detecting an unsafe partial generation, followed by generating a safe alternative. This training paradigm is formulated as:

$\mathcal{L}(\theta) = -\mathbb{E}_{(x, y^+, y^-)} \left[ \log p_\theta \left( \text{RESET} \oplus y^+ \,|\, x \oplus \text{prefix}(y^-) \right) \right] - \mathbb{E}_{(x, y^+)}\left[ \log p_\theta \left( y^+ \,|\, x \right) \right]$

Direct Preference Optimization (DPO)

Building on the SFT model, DPO further optimizes the model's behavior by constructing preference pairs that encourage backtracking on unsafe generations. Preference pairs such as ( $\text{prefix}(y^-) \oplus \text{RESET} \oplus y^+$ , $y^-$ ) are used, ensuring that the model correctly backtracks when necessary.

Inference

During inference, if the model generates the $\text{RESET}$ token, all tokens generated before it are discarded, and only the tokens after the $\text{RESET}$ token are considered as the final output.

Results and Analysis

The evaluation of backtracking models showed substantial improvements in safety without compromising helpfulness.

Safety Improvement: The backtracking Llama-3-8B model demonstrated a 4x reduction in unsafe responses compared to the baseline, with safety violation rates dropping from 6.1% to 1.5%.
Efficiency Trade-off: While backtracking can introduce latency (an additional second on average), the efficient gains via logit bias adjustment mitigate this tradeoff, maintaining safety while minimizing impact on generation speed.

Moreover, backtracking models exhibited strong resistance against adversarial attacks:

Adversarial Robustness: Against various attack methods, including prefilling, GCG, AutoDAN, and an adaptive attack specifically designed to counter backtracking, the models showcased improved robustness. For example, the attack success rate for the prefilling attack dropped dramatically from 50.4% to 11.5% in the backtracking Gemma-2-2B model.

Implications and Future Directions

From a theoretical perspective, the implementation of backtracking signifies a shift from solely prevention-based safety to incorporating recovery mechanisms within LLMs. Practically, this method can be applied to a variety of content moderation scenarios where ensuring safe text output is critical.

The paper opens avenues for further research:

Beyond Safety: Exploring backtracking's applicability to countering hallucinations and enhancing reasoning in LLMs.
Robustness Enhancement: Integrating adversarial training and steering internal activations to further solidify the robustness of backtracking against sophisticated attacks.
Efficiency Optimization: Investigating strategies to refine the efficiency-safety balance, ensuring high performance in production environments.

In conclusion, while the backtracking method significantly enhances the safety of LLM-generated content, additional research is necessary to tackle inherent trade-offs and explore broader applications, ultimately aiming for more resilient, reliable AI systems.