Papers
Topics
Authors
Recent
2000 character limit reached

DRAGON: Reasoning-Based Unlearning in LLMs

Updated 15 November 2025
  • DRAGON is a systematic, reasoning-based framework designed for training-free unlearning in large language models.
  • It combines a negative detection module with a Chain-of-Thought guard to enforce privacy and policy compliance during inference.
  • The framework demonstrates high detector recall, stable utility preservation, and scalability across diverse unlearning scenarios.

Detect-Reasoning Augmented GeneratiON (DRAGON) refers to a systematic, reasoning-based framework for robust, training-free unlearning in LLMs. Unlearning aims to ensure that a deployed LLM behaves as if it had never seen a specified set of “forget” data (e.g., private records or harmful concepts), ideally without degrading the model’s overall utility and without access to retained training data. In contrast to traditional fine-tuning-based methods, DRAGON leverages in-context intervention at inference time, utilizing the inherent instruction-following capabilities of LLMs to provide scalable policy enforcement in privacy-sensitive and regulated domains (Wang et al., 8 Nov 2025).

1. Motivation and Conceptual Foundations

LLMs trained on vast corpora can inadvertently memorize private user data or encode hazardous knowledge. The unlearning problem asks: given a collection of “forget” prompts or concepts, how can one ensure the LLM does not reproduce or infer information related to them, while leaving all other behaviors unaffected? Prior approaches predominantly rely on training-time interventions (e.g., negative loss fine-tuning, parameter reweighting), which require access to both forget and retain data, entail considerable computation, and frequently introduce utility degradation. Training-free alternatives, such as prompt filtering and embedding defenses, face challenges in robustness and maintainability, particularly for continual or on-demand unlearning scenarios.

The central insight of DRAGON is to augment LLM inference with a detection and reasoning pipeline, avoiding any modification to the LLM’s parameters. Given an “unlearn store” of forget-worthy prompts and concepts (possibly only accessible as paraphrases without retain data), DRAGON detects sensitive queries and inserts an in-context Chain-of-Thought (CoT) reasoning guard, directing the LLM toward policy-compliant refusal or redaction.

2. DRAGON Framework Architecture

The DRAGON framework consists of two cascaded modules: a negative detection module and a chain-of-thought guard.

2.1 Negative Detection Module

The negative detector receives each inference-time prompt x\mathbf{x} and computes a confidence score f(x,Du)f(\mathbf{x}, D_u), where DuD_u denotes the current unlearn store:

  • Sample-unlearning (e.g., privacy):

f(x,Du)=EM(x)+maxeuDu[cos(e,eu)]f(\mathbf{x}, D_u) = \mathrm{EM}(\mathbf{x}) + \max_{\mathbf{e}_u \in D_u} \left[ \cos(\mathbf{e}, \mathbf{e}_u) \right]

Here, e\mathbf{e} and eu\mathbf{e}_u are prompt and forget example embeddings, with EM(x)\mathrm{EM}(\mathbf{x}) as an exact match flag.

  • Concept-unlearning (e.g., hazardous concepts):

f(x,Du)=I(pF(x)>τ1)+maxxuDuBERTScore(x,xu)+ROUGE-L(Du,x)f(\mathbf{x}, D_u) = \mathbb{I}(p_F(\mathbf{x})>\tau_1) + \max_{\mathbf{x}_u \in D_u} \mathrm{BERTScore}(\mathbf{x}, \mathbf{x}_u) + \mathrm{ROUGE\text{-}L}(D_u, \mathbf{x})

pF(x)p_F(\mathbf{x}) is the output of a fine-tuned scorer (Llama-3.1-7B); BERTScore and ROUGE-L validate semantic similarity to forget prompts.

A prompt exceeding threshold τ\tau is routed for guarded intervention.

2.2 Chain-of-Thought Guard

A lightweight, supervised CoT guard model (Llama-3.1-8B-Instruct) generates contextually grounded reasoning instructions, supervised on paired (question, CoT instruction) examples. The original prompt is recast into a guarded prompt:

1
2
3
**Context:** <policy or synthetic record>
**Question:** {x}
**Final instruction:** Let's think step by step. <CoT instruction generated by guard model>

The CoT-specific instruction prompts step-by-step reasoning about whether to refuse or redact per the relevant policy (e.g., GDPR/CCPA compliance), guiding the downstream LLM to maintain compliance.

2.3 Routing and Integration

The procedure is formalized as follows:

1
2
3
4
5
6
7
8
9
def DRAGON_Inference(x, D_u, τ):
    score = Detector(x, D_u)
    if score > τ:
        policy = RetrievePolicy(D_u, x)
        cot_inst = GuardModel.generate_cot(x, policy)
        guarded_prompt = ComposePrompt(policy, x, cot_inst)
        return BaseLLM.generate(guarded_prompt)
    else:
        return BaseLLM.generate(x)

The modular design enables integration into existing LLM inference pipelines, functioning as a pre- and post-processing layer.

3. Unlearning Evaluation Metrics

To rigorously assess unlearning quality and stability (including continual unlearning), DRAGON introduces the following metrics:

  • Refusal Quality (RQ): Aggregates (1) template similarity (cosine to refusal templates), (2) refusal rate (binary classification), and (3) normalized gibberish score, with higher RQ reflecting higher-quality, policy-compliant refusals.
  • Dynamic Deviation Score (DDS):

DDS=1Ti=1Tsi+βT1i=1T1max(0,si+1si),β=0.5\mathrm{DDS} = \frac{1}{T}\sum_{i=1}^T s_i + \frac{\beta}{T-1} \sum_{i=1}^{T-1} \max(0, s_{i+1} - s_i), \quad \beta=0.5

Measures cumulative and incremental drift after sequential unlearning; lower is preferable.

  • Dynamic Utility Score (DUS):

DUS=11T1i=1T1ui+1ui\mathrm{DUS} = 1 - \frac{1}{T-1} \sum_{i=1}^{T-1} |u_{i+1} - u_i|

Assesses the stability of retained non-sensitive utility; larger DUS reflects stronger utility preservation.

4. Experimental Evaluation

DRAGON is evaluated across diverse unlearning scenarios:

Task Description Metrics
WMDP Hazardous knowledge unlearning ProbAcc, RQ
TOFU Privacy record unlearning (fictitious data) Deviation Score, Model Utility
MUSE Copyrighted content unlearning VerbMem, KnowMem

Benchmark LLMs include Zephyr-7B, Llama-3.1-8B-Instruct, Mixtral-8×7B, GPT-4o, and Llama-4. Baselines range from fine-tuning-based approaches (GA, KL, GD, PO, DPO, NPO-RT, FLAT, RMU) to prompt-based methods (Filter-Prompting, ICUL+).

Quantitative results demonstrate:

  • WMDP: ProbAcc reduced to near chance (25%\approx25\%), with RQ up to 1.296 (Mixtral-47B); base task accuracy (MMLU) is unaffected.
  • TOFU: Lowest Deviation Score (21.4, Llama2-7B-Chat at 1%) versus best baseline (37.9); retained Model Utility (0.6337) matches that of original model; KFR = 0.98, KRR = 0.88.
  • MUSE: On both news and books, DRAGON is the only method to meet all MUSE requirements and outperforms others on all verbatim and knowledge memory metrics.
  • Scaling: Larger and “Instruct” LLMs yield superior results. The pipeline scales from 1.5B to 70B-parameter LLMs.

Ablation and qualitative analysis reveal: (1) injected CoT guard reliably elicits policy-compliant refusals, (2) detector recall above 95% for diverse paraphrase sets, (3) continual unlearning remains stable with DDS = 0.2494 and DUS = 1.0.

5. Scalability, Deployment, and Limitations

5.1 Scalability

DRAGON’s detector imposes minimal computational overhead (5–10 ms per query). The guard model is pre-trained (30–50 minutes on 2×A100) and usable across base LLM variants, including black-box APIs and instruction-tuned models, without any need for parameter updates or retraining per unlearn request. New forget prompts or concepts can be incorporated rapidly by updating DuD_u with paraphrases and policies.

5.2 Deployment and Applicability

The framework is readily applicable in regulated or privacy-centric domains (healthcare, finance), enabling policy enforcement at inference without the risks or cost of repeated fine-tuning. The modular design is compatible with both open and closed LLM APIs and does not induce latency or utility penalties for non-sensitive prompts.

5.3 Limitations

  • Bypass Potential: API or controlled-access LLMs are required; open-weight models may be fine-tuned to circumvent in-context guards.
  • Scaling Limitations: Smaller LLMs (\leq2B parameters) show degraded CoT compliance.
  • Detector Coverage: Recall depends on paraphrase diversity in DuD_u; adversarial out-of-distribution queries can slightly reduce recall (stays >>95% in reported experiments).

A plausible implication is that while DRAGON’s architecture provides robust, scalable unlearning with minimal operational friction, adversarial real-world scenarios may eventually challenge paraphrase recall and CoT compliance in frontier LLMs.

6. Broader Significance and Future Directions

The DRAGON approach provides an effective, modular, and training-free solution to practical LLM unlearning, combining robust detection of forget-worthy queries with in-context policy reasoning to ensure secure and compliant LLM deployments. Unique to DRAGON is its strong support for continual unlearning—new removal requests can be satisfied instantly, with no retraining or access to retain data required.

This paradigm is particularly suited to real-time policy compliance in dynamic production environments, notably where user data privacy, regulatory auditability, and model generalizability must be balanced without direct intervention in LLM weights.

Looking forward, challenges remain in adversarial robustness, adapting to generative settings, and extending guard coverage as LLMs increase in scale and flexibility. Extending DRAGON’s in-context paradigms to support not just refusals but safe, grounded reformulations or context-specific redactions is a plausible avenue for further refinement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Detect-Reasoning Augmented GeneratiON (DRAGON).