DRAGON: Reasoning-Based Unlearning in LLMs
- DRAGON is a systematic, reasoning-based framework designed for training-free unlearning in large language models.
- It combines a negative detection module with a Chain-of-Thought guard to enforce privacy and policy compliance during inference.
- The framework demonstrates high detector recall, stable utility preservation, and scalability across diverse unlearning scenarios.
Detect-Reasoning Augmented GeneratiON (DRAGON) refers to a systematic, reasoning-based framework for robust, training-free unlearning in LLMs. Unlearning aims to ensure that a deployed LLM behaves as if it had never seen a specified set of “forget” data (e.g., private records or harmful concepts), ideally without degrading the model’s overall utility and without access to retained training data. In contrast to traditional fine-tuning-based methods, DRAGON leverages in-context intervention at inference time, utilizing the inherent instruction-following capabilities of LLMs to provide scalable policy enforcement in privacy-sensitive and regulated domains (Wang et al., 8 Nov 2025).
1. Motivation and Conceptual Foundations
LLMs trained on vast corpora can inadvertently memorize private user data or encode hazardous knowledge. The unlearning problem asks: given a collection of “forget” prompts or concepts, how can one ensure the LLM does not reproduce or infer information related to them, while leaving all other behaviors unaffected? Prior approaches predominantly rely on training-time interventions (e.g., negative loss fine-tuning, parameter reweighting), which require access to both forget and retain data, entail considerable computation, and frequently introduce utility degradation. Training-free alternatives, such as prompt filtering and embedding defenses, face challenges in robustness and maintainability, particularly for continual or on-demand unlearning scenarios.
The central insight of DRAGON is to augment LLM inference with a detection and reasoning pipeline, avoiding any modification to the LLM’s parameters. Given an “unlearn store” of forget-worthy prompts and concepts (possibly only accessible as paraphrases without retain data), DRAGON detects sensitive queries and inserts an in-context Chain-of-Thought (CoT) reasoning guard, directing the LLM toward policy-compliant refusal or redaction.
2. DRAGON Framework Architecture
The DRAGON framework consists of two cascaded modules: a negative detection module and a chain-of-thought guard.
2.1 Negative Detection Module
The negative detector receives each inference-time prompt and computes a confidence score , where denotes the current unlearn store:
- Sample-unlearning (e.g., privacy):
Here, and are prompt and forget example embeddings, with as an exact match flag.
- Concept-unlearning (e.g., hazardous concepts):
is the output of a fine-tuned scorer (Llama-3.1-7B); BERTScore and ROUGE-L validate semantic similarity to forget prompts.
A prompt exceeding threshold is routed for guarded intervention.
2.2 Chain-of-Thought Guard
A lightweight, supervised CoT guard model (Llama-3.1-8B-Instruct) generates contextually grounded reasoning instructions, supervised on paired (question, CoT instruction) examples. The original prompt is recast into a guarded prompt:
1 2 3 |
**Context:** <policy or synthetic record>
**Question:** {x}
**Final instruction:** Let's think step by step. <CoT instruction generated by guard model> |
The CoT-specific instruction prompts step-by-step reasoning about whether to refuse or redact per the relevant policy (e.g., GDPR/CCPA compliance), guiding the downstream LLM to maintain compliance.
2.3 Routing and Integration
The procedure is formalized as follows:
1 2 3 4 5 6 7 8 9 |
def DRAGON_Inference(x, D_u, τ): score = Detector(x, D_u) if score > τ: policy = RetrievePolicy(D_u, x) cot_inst = GuardModel.generate_cot(x, policy) guarded_prompt = ComposePrompt(policy, x, cot_inst) return BaseLLM.generate(guarded_prompt) else: return BaseLLM.generate(x) |
The modular design enables integration into existing LLM inference pipelines, functioning as a pre- and post-processing layer.
3. Unlearning Evaluation Metrics
To rigorously assess unlearning quality and stability (including continual unlearning), DRAGON introduces the following metrics:
- Refusal Quality (RQ): Aggregates (1) template similarity (cosine to refusal templates), (2) refusal rate (binary classification), and (3) normalized gibberish score, with higher RQ reflecting higher-quality, policy-compliant refusals.
- Dynamic Deviation Score (DDS):
Measures cumulative and incremental drift after sequential unlearning; lower is preferable.
- Dynamic Utility Score (DUS):
Assesses the stability of retained non-sensitive utility; larger DUS reflects stronger utility preservation.
4. Experimental Evaluation
DRAGON is evaluated across diverse unlearning scenarios:
| Task | Description | Metrics |
|---|---|---|
| WMDP | Hazardous knowledge unlearning | ProbAcc, RQ |
| TOFU | Privacy record unlearning (fictitious data) | Deviation Score, Model Utility |
| MUSE | Copyrighted content unlearning | VerbMem, KnowMem |
Benchmark LLMs include Zephyr-7B, Llama-3.1-8B-Instruct, Mixtral-8×7B, GPT-4o, and Llama-4. Baselines range from fine-tuning-based approaches (GA, KL, GD, PO, DPO, NPO-RT, FLAT, RMU) to prompt-based methods (Filter-Prompting, ICUL+).
Quantitative results demonstrate:
- WMDP: ProbAcc reduced to near chance (), with RQ up to 1.296 (Mixtral-47B); base task accuracy (MMLU) is unaffected.
- TOFU: Lowest Deviation Score (21.4, Llama2-7B-Chat at 1%) versus best baseline (37.9); retained Model Utility (0.6337) matches that of original model; KFR = 0.98, KRR = 0.88.
- MUSE: On both news and books, DRAGON is the only method to meet all MUSE requirements and outperforms others on all verbatim and knowledge memory metrics.
- Scaling: Larger and “Instruct” LLMs yield superior results. The pipeline scales from 1.5B to 70B-parameter LLMs.
Ablation and qualitative analysis reveal: (1) injected CoT guard reliably elicits policy-compliant refusals, (2) detector recall above 95% for diverse paraphrase sets, (3) continual unlearning remains stable with DDS = 0.2494 and DUS = 1.0.
5. Scalability, Deployment, and Limitations
5.1 Scalability
DRAGON’s detector imposes minimal computational overhead (5–10 ms per query). The guard model is pre-trained (30–50 minutes on 2×A100) and usable across base LLM variants, including black-box APIs and instruction-tuned models, without any need for parameter updates or retraining per unlearn request. New forget prompts or concepts can be incorporated rapidly by updating with paraphrases and policies.
5.2 Deployment and Applicability
The framework is readily applicable in regulated or privacy-centric domains (healthcare, finance), enabling policy enforcement at inference without the risks or cost of repeated fine-tuning. The modular design is compatible with both open and closed LLM APIs and does not induce latency or utility penalties for non-sensitive prompts.
5.3 Limitations
- Bypass Potential: API or controlled-access LLMs are required; open-weight models may be fine-tuned to circumvent in-context guards.
- Scaling Limitations: Smaller LLMs (2B parameters) show degraded CoT compliance.
- Detector Coverage: Recall depends on paraphrase diversity in ; adversarial out-of-distribution queries can slightly reduce recall (stays 95% in reported experiments).
A plausible implication is that while DRAGON’s architecture provides robust, scalable unlearning with minimal operational friction, adversarial real-world scenarios may eventually challenge paraphrase recall and CoT compliance in frontier LLMs.
6. Broader Significance and Future Directions
The DRAGON approach provides an effective, modular, and training-free solution to practical LLM unlearning, combining robust detection of forget-worthy queries with in-context policy reasoning to ensure secure and compliant LLM deployments. Unique to DRAGON is its strong support for continual unlearning—new removal requests can be satisfied instantly, with no retraining or access to retain data required.
This paradigm is particularly suited to real-time policy compliance in dynamic production environments, notably where user data privacy, regulatory auditability, and model generalizability must be balanced without direct intervention in LLM weights.
Looking forward, challenges remain in adversarial robustness, adapting to generative settings, and extending guard coverage as LLMs increase in scale and flexibility. Extending DRAGON’s in-context paradigms to support not just refusals but safe, grounded reformulations or context-specific redactions is a plausible avenue for further refinement.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free