Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference (2407.00075v5)

Published 21 Jun 2024 in cs.AI, cs.CL, cs.CR, and cs.LG

Abstract: We study how to subvert LLMs from following prompt-specified rules. We first formalize rule-following as inference in propositional Horn logic, a mathematical system in which rules have the form "if $P$ and $Q$, then $R$" for some propositions $P$, $Q$, and $R$. Next, we prove that although small transformers can faithfully follow such rules, maliciously crafted prompts can still mislead both theoretical constructions and models learned from data. Furthermore, we demonstrate that popular attack algorithms on LLMs find adversarial prompts and induce attention patterns that align with our theory. Our novel logic-based framework provides a foundation for studying LLMs in rule-based settings, enabling a formal analysis of tasks like logical reasoning and jailbreak attacks.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Logicbreaks, a framework using propositional Horn logic to model rule-based inference in LLMs and analyze its subversion.
Theoretical attacks designed within the framework demonstrate how adversarial prompts can subvert rule-following by disrupting fact recall or rule application.
The findings emphasize the critical need for robust defenses in LLMs to prevent rule-based inference subversion and ensure reliable and safe outputs.

A Framework for Understanding Rule-Based Inference Subversion in LLMs

The paper "Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference" by Xue et al. addresses the critical issue of ensuring that LLMs adhere to rule-based prompts, which is pivotal for the reliability and safety of their outputs. By conceptualizing rule-following as an inference task in propositional Horn logic, the authors provide a structured methodology for analyzing how LLMs can be subverted from their intended outputs through crafted adversarial prompts.

Modeling Rule-Following in Horn Logic

The authors propose modeling rule-based inference using propositional Horn logic, which allows the representation of rules in the form of implications such as "if $P$ and $Q$ , then $R$ ". Within this framework, rule-following corresponds to executing forward chaining until no new information can be deduced from the rule set. This formulation aligns with classical logic systems used in AI and enables a structured analysis of how rule adherence can be sabotaged.

Three core properties—monotonicity, maximality, and soundness—are introduced to characterize rule-following in this context:

Monotonicity: Ensures that the set of known facts does not shrink across inference steps.
Maximality: Guarantees that every applicable rule is applied, promoting logical completeness.
Soundness: Ensures propositions can only be derived if they existed in prior proofs or are the consequence of applied rules.

Theoretical Construction with Transformers

A key contribution of this paper is demonstrating that a simplified transformer architecture, with a single layer and a single attention head, can faithfully simulate the rule-application process of propositional logic. This minimal construction forms the basis for understanding how theoretical models may adhere to rule-based reasoning.

Through this theoretical lens, various modes of attack are constructed to illustrate how adversarial prompts can exploit the rule-following mechanism:

Fact Amnesia Attack: Overrides memory of certain known facts.
Rule Suppression Attack: Prevents specific rules from being applied.
State Coercion Attack: Forces the model towards arbitrary incorrect states.

Each attack leverages the transformer’s mechanism by disrupting its ability to process rules correctly through carefully crafted inputs that cause the model to deviate from logical inference.

Transferability and Experiments

Empirical investigations demonstrate the transferability of theoretical attacks to more complex, data-driven models. Fine-tuned versions of GPT-2, subject to structured training data, are used as empirical proxies for the theoretical models to validate the effectiveness of these adversarial techniques. Notably, the attacks designed under the theoretical model settings show success even in learned models, highlighting the robustness of these attack paradigms.

Furthermore, the paper examines real-world models like Llama-2, showing similar vulnerabilities and validating the theoretical insights. Adversarial experiments using jailbreak techniques like GCG substantiate the occurrence of attention patterns akin to those predicted theoretically.

Implications and Future Prospects

The insights from this framework emphasize the necessity of incorporating robust defenses in LLM deployment, ensuring these systems fulfill safety and factual accuracy requirements, especially in settings where incorrect outputs could lead to significant consequences like legal liabilities or misinformation.

This paper highlights the ongoing need for research into more resilient LLMs, focusing on enhancing their understanding and adherence to rule-based instructions. It also opens avenues for refining methods to verify the logical soundness of model outputs routinely.

Conclusion

Xue et al.'s work provides an intricate look at the dynamics of rule-based inference in LLMs, demonstrating both theoretical insights and practical implications. Given the increasing reliance on LLMs across domains, understanding, and mitigating potential adversarial subversions is essential for advancing the field responsibly. Future research could investigate more comprehensive logic systems and develop advanced safeguards to further bridge the gap between theoretical understanding and practical deployment of LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AntonXue/status/1810745094087266403

https://twitter.com/AntonXue/status/1915637607020511401

https://twitter.com/AntonXue/status/1915626648873164929

YouTube

Show All Videos