Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 19 tok/s

GPT-5 High 18 tok/s Pro

GPT-4o 96 tok/s

GPT OSS 120B 473 tok/s Pro

Kimi K2 26 tok/s Pro

2000 character limit reached

CRISP: Persistent Concept Unlearning via Sparse Autoencoders (2508.13650v1)

Published 19 Aug 2025 in cs.CL

Abstract: As LLMs are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.

Collections

Summary

The paper introduces CRISP, a method that persistently unlearns unwanted knowledge from language models using sparse autoencoders.
It utilizes a two-phase process with contrastive activation analysis for feature selection and LoRA-based fine-tuning to suppress target features while preserving benign ones.
Experimental results on WMDP benchmarks demonstrate that CRISP outperforms existing unlearning methods with significant gains in unlearning and retention metrics.

Persistent Concept Unlearning via Sparse Autoencoders

Introduction

The paper introduces CRISP (Concept Removal via Interpretable Sparse Projections), a parameter-efficient methodology designed to persistently unlearn specific knowledge from LLMs using Sparse Autoencoders (SAEs). As LLMs continue to be implemented in real-world applications, the demand to erase certain knowledge—due to safety, privacy, or copyright concerns—grows. Unlike previous methods that often operate at inference time, CRISP ensures changes in the model's parameters, making them less susceptible to reversal by adversarial entities.

Methodology

CRISP operates in a two-phase process:

Feature Selection:
- Contrastive Activation Analysis: CRISP leverages pre-trained SAEs to identify features frequently activated by the target (unwanted) corpus but not the benign (retained) corpus (Figure 1). For every feature, two primary metrics are computed:
  - Activation Count Difference: Quantifies the difference in activation frequency between target and benign datasets.
  - Relative Activation Ratio: Measures the feature's relative prominence in the target dataset.

Salient Features Extraction: By applying thresholds on activation metrics, CRISP selects salient features for suppression.

Model Optimization:
- Parameter-Efficient Fine-Tuning with LoRA: Incorporating a multi-faceted loss function that penalizes target feature activation while encouraging retention of benign features. It combines unlearning, retention, and coherence losses weighted to balance the three objectives.
- Layerwise Updates: Features identified in specific layers where activations are mapped and suppressed efficiently.
  Figure 1: Overview of CRISP: (1) We identify features that are frequently and strongly activated by the target corpus---but not by the benign corpus---using pre-trained sparse autoencoders (SAEs). (2) We then fine-tune the model to suppress these features on the target corpus, while preserving their activations on the benign corpus.

Experimental Setup

Datasets and Models

CRISP is evaluated on datasets from the WMDP benchmark, including domains such as biosecurity and cybersecurity, across two LLMs: Llama-3.1-8B and Gemma-2-2B. These datasets contain target and retain corpora tailored to stress-test the unlearning capabilities while preserving unrelated knowledge.

Baseline Comparisons

CRISP is compared against contemporary unlearning techniques: RMU and ELM. Unlike these methods, CRISP leverages SAE-based feature suppression, allowing for more selective, less disruptive parameter updates.

Evaluation Metrics

Performance is evaluated based on:

Unlearn Accuracy: Lower unplanned concept retention indicates stronger unlearning.
Retain Accuracy and MMLU: Reflects the model's capacity to preserve general and specific benign knowledge.
Fluency and Concept Scores: Assesses the generation quality and ability to maintain relevant conceptual knowledge.

Results

CRISP demonstrates superior performance, achieving a balance of unlearning efficacy and retention across metrics, notably outperforming baselines with improvements ranging from 5 to 34 points on the WMDP benchmark. Qualitatively, CRISP maintains fluency and coherence better than alternatives which often degrade model generation quality or lead to off-topic responses.

Figure 2: Qualitative comparison of generations after different unlearning methods. We prompt about non-harmful biomedical knowledge that is topically related to harmful concepts from the WMDP-Bio dataset. While existing methods disrupt fluency or inject artifacts (e.g., repetition, formatting tokens), CRISP retains coherent and informative generations, demonstrating effective preservation of general-domain capabilities.

Feature-Level Analysis

CRISP's interpretability is evidenced through feature analysis, emphasizing its precision in targeting unwanted knowledge while preserving others:

Target Features: Predominantly characterize harmful biomedical terms, e.g., references linked to viral infections and contagion mechanisms.
Benign Features: Encompass keywords indicative of general research and clinical terminology.
Shared Features: Highlight non-specific linguistic tokens, demonstrating CRISP's capability to avoid generalized suppression.

Figure 3: Trade-off between Retain Accuracy (y-axis) and Unlearn Accuracy (x-axis) on the WMDP-Cyber benchmark. Top: Llama-3.1-8B, Bottom: Gemma-2-2B. Each point shows one of 200 hyperparameter settings per method. The red star indicates the ideal outcome—complete forgetting with no loss in retain accuracy. The solid line traces the best result per unlearning bucket, forming the Pareto frontier.

Conclusion

The introduction of CRISP sets a new standard in persistent concept unlearning in LLMs by combining interpretable SAE feature extractions with efficient parameter tuning. It achieves an optimal decommissioning of undesired knowledge while retaining fluency in general model outputs. Further research could expand its application scope, adapt it to diverse models, and evaluate its resilience against adversarial attacks in variable contexts.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

Tweets

https://twitter.com/tomerashuach/status/1958437914599104606

https://twitter.com/HuggingPapers/status/1960134856077164911

https://twitter.com/arxivsanitybot/status/1958525382845513894

YouTube

Show All Videos

alphaXiv

CRISP: Persistent Concept Unlearning via Sparse Autoencoders (25 likes, 0 questions)