EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models (2504.15133v1)

Published 21 Apr 2025 in cs.CL, cs.AI, cs.CV, cs.HC, and cs.LG

Abstract: In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling LLM behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://zjunlp.github.io/project/EasyEdit2/video for a quick introduction.

Summary

The paper introduces EasyEdit2, a framework that guides LLM behavior at inference by applying steering vectors without changing model parameters.
The framework combines a steering vector generator and applier with a library of pre-trained vectors to control safety, sentiment, personality, and reasoning.
Experiments on models like Gemma-2-9B show that methods such as CAA and STA significantly improve output reliability and precision compared to baselines.

EasyEdit2 (2504.15133) is presented as an easy-to-use framework designed for test-time steering of LLMs. The core idea is to modify LLM behavior during inference without altering the model's underlying parameters, providing fine-grained control over various aspects of the output. This approach contrasts with traditional model editing techniques that permanently change model weights.

The framework aims to address the challenge of controlling LLM behavior in real-world applications, where issues like generating unreliable or unsafe outputs, inconsistent style, or undesirable reasoning patterns may arise. By enabling precise intervention during the forward pass, EasyEdit2 allows users to adjust model responses based on specific needs or observed behaviors. The framework is designed for ease of use, requiring minimal technical expertise, often needing only a single example to guide the model's output.

Key Features and Architecture

EasyEdit2 features a new architecture built around two primary modules:

Steering Vector Generator Module: This module is responsible for creating "steering vectors," which represent the desired behavioral shift. It supports various methods for generating these vectors, such as Contrastive Activation Addition (CAA) [CAA] and methods leveraging Sparse Autoencoders (SAE) features. It iterates over datasets based on configured hyperparameters to compute these vectors.
Steering Vector Applier Module: This module integrates the generated steering vectors into the target LLM during inference. It supports multiple steering methods concurrently, including prompt-based, activation-based, and future decoding-based approaches. A model wrapper simplifies the application process, allowing multiple vectors and prompts to be combined.

Beyond these core modules, the framework includes:

Steering Vector Library: A repository of pre-trained steering vectors for common scenarios, offering a plug-and-play option for users. It also supports vector merging techniques like Linear [Linear], TIES [TIES], and DARE [DARE] TIES to combine multiple vectors for complex control.
Datasets Module: Standardizes data loading and preprocessing from various formats for use in both vector generation and evaluation.
Hparams Module: A two-tiered system for managing hyperparameters, ensuring consistent and reproducible configuration across different methods and experiments.
Evaluators Module: Provides tools to assess the effectiveness of steering across different dimensions. It supports rule-based, classifier-based, and LLM-based evaluation methods, allowing for adaptive and user-defined scenario assessments, inspired by AXBENCH (2501.17148).

Supported Intervention Scenarios

EasyEdit2 supports steering LLMs across a wide range of behaviors:

Safety: Modifying responses to resist jailbreak attacks, reduce social biases, reject harmful queries, enforce regulatory compliance, and mitigate privacy risks.
Sentiment: Controlling the emotional tone of responses, from negative to positive.
Personality: Shaping the model's persona and underlying values, enabling effective role-playing.
Reasoning Pattern: Influencing how the model processes information, such as controlling reasoning length, balancing knowledge sources, or enforcing specific reasoning structures.
Factuality: Steering outputs towards or away from specific factual claims, mitigating hallucinations, enabling knowledge forgetting, and promoting self-verification.
Language Feature: Controlling response language, formatting, syntax, and style.

Steering Methods

The framework categorizes supported methods into three main types:

Prompt-based Steering: Uses prompt engineering (manual or auto-generated) to guide model output.
Activation-based Interventions: Generates and applies steering vectors to model activations during the forward pass. Examples include:
- Contrastive Activation Addition (CAA): Computes the difference in activations between desired and undesired example pairs to create a steering vector.
- LM-Steer: Applies a learned linear transformation to output embeddings.
- SAE Feature Steering: Uses interpretable features extracted from Sparse Autoencoders as steering vectors.
- Steering Target Atoms (STA): Extends CAA using SAEs to refine vectors.
Decoding-based Control: Modifies the model's generation logic during decoding. An interface is reserved for future integration of such methods.

Implementation and Usage

EasyEdit2 is implemented to be easily accessible. Users can configure the entire process using a unified configuration file. The framework's design allows for flexible execution of various methods on different datasets. The paper highlights a minimal code snippet demonstrating how to load configuration, generate vectors, apply them, and get steered responses, emphasizing the low-code nature.

from easyedit import (
    set_hparams,
    SteeringVectorGenerator,
    SteeringVectorApplier,
    DatasetLoader,
    Evaluators,
)

config = set_hparams("path/to/config.yaml")

data_loader = DatasetLoader(config.datasets)
generation_data = data_loader.get_dataset("generation_set")
evaluation_data = data_loader.get_dataset("evaluation_set")

generator = SteeringVectorGenerator(config.generator)
steering_vectors = generator.generate(generation_data)

applier = SteeringVectorApplier(config.applier)
steered_model = applier.apply(model, steering_vectors) # Assumes 'model' is loaded elsewhere

steered_outputs = steered_model.generate(evaluation_data.prompts)

evaluator = Evaluators(config.evaluator)
results = evaluator.evaluate(steered_outputs, evaluation_data.ground_truth)

hparams = set_hparams("path/to/language_feature_hparams.yaml")
generator = SteeringVectorGenerator(hparams.generator)
applier = SteeringVectorApplier(hparams.applier)
steering_vector = generator.generate(hparams.generator.gen_data)
result = applier.apply(model, steering_vector, prompt="Which club is Messi at?")
print(result)

Experiments and Results

The paper presents experimental results on safety and sentiment steering using Gemma-2-9B [Riviere2024Gemma2I] and Qwen2.5-7B [qwen2]. Evaluation metrics included Defense Rate (for safety), Positive Rate (for sentiment), and Fluency. The results show that the tested steering methods (CAA, STA, LM-Steer, Prompt $_{Auto}$ ) generally outperform the baseline model without steering. CAA and STA were particularly effective for safety and sentiment, while LM-Steer and Prompt $_{Auto}$ showed improvements but had some limitations depending on hyperparameters and prompt quality.

Practical Aspects and Ethical Considerations

The authors provide an online demo built with Gradio, allowing users to interact with the steered model in real-time and explore different steering effects. The source code is open-sourced under the MIT License, facilitating use and modification. Case studies demonstrate the framework's ability to induce significant behavioral shifts across the six scenarios, including making a model unsafe from a safe state. The authors explicitly discuss the significant ethical risks associated with steering techniques, particularly the potential for misuse to generate harmful or unethical content. They emphasize the need for rigorous safety inspections and ethical safeguards when using EasyEdit2. The framework is intended to benefit the community by providing a tool for precise LLM control and supporting interpretable analysis via SAE features.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/papers_anon/status/1914522797923311973

YouTube

Show All Videos