AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders (2501.17148v3)

Published 28 Jan 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Fine-grained steering of LLM outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.

Summary

The paper introduces AxBench as a standardized evaluation framework for steering LLM outputs and detecting internal concepts.
It systematically compares diverse techniques—prompting, finetuning, and representation-based methods like DiffMean and ReFT-r1—to assess control and interpretability.
The findings challenge the effectiveness of sparse autoencoders and promote parameter-efficient methods for safer, more reliable LLM behavior.

The paper "AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders" (2501.17148) introduces a benchmark, AxBench, designed to systematically evaluate and compare various techniques for steering the outputs of LLMs and detecting internal concepts. The core motivation stems from the need for reliable and safe LLM behavior, which necessitates fine-grained control over model generation. While prompting and finetuning are prevalent methods, the interpretability field has proposed several representation-based alternatives, including sparse autoencoders (SAEs), linear artificial tomography (LAT), supervised steering vectors, linear probes, and representation finetuning (ReFT). However, the lack of a standardized evaluation framework hindered direct comparisons. AxBench aims to fill this gap.

AxBench: A Benchmark for Steering and Concept Detection

AxBench provides a large-scale evaluation platform focused on two primary tasks: steering LLM outputs towards or away from specific concepts and detecting the presence of those concepts within the model's internal representations. The benchmark facilitates comparisons between diverse methods operating at different levels: input modification (prompting), parameter modification (finetuning, ReFT), and representation intervention (SAEs, steering vectors, LAT).

The experiments reported utilize Gemma-2 models, specifically the 2B and 9B parameter variants. The benchmark encompasses a range of concepts relevant to safety and desired model behavior. By establishing common tasks and metrics, AxBench allows for a quantitative assessment of the efficacy of different approaches intended to influence or understand LLM behavior at a granular level. The benchmark itself, along with trained feature dictionaries for some methods, has been publicly released, encouraging further research and standardized comparisons.

Comparative Evaluation of Steering Techniques

A key finding from the AxBench evaluations concerns the relative performance of different steering methodologies. The results indicate a clear hierarchy in terms of steering effectiveness:

Prompting: Consistently demonstrated the highest performance in guiding model outputs according to the benchmark tasks. This suggests that modifying the input context remains a highly effective, albeit potentially less interpretable, method for controlling generation.
Finetuning: Full-parameter finetuning emerged as the second most effective technique. While computationally more intensive than prompting or representation-based methods, adapting the model's weights directly provides substantial control over its behavior.
Representation-Based Methods: The performance of various representation-based techniques was mixed. Notably, Sparse Autoencoders (SAEs), despite their popularity in interpretability research for purportedly identifying meaningful features, were found to be not competitive on the steering tasks within AxBench. Other methods like LAT, supervised steering vectors, and linear probes also generally underperformed compared to prompting and finetuning.

This outcome challenges the assumption that interventions based on decomposed representations, such as those derived from SAEs, readily translate into effective steering mechanisms. The benchmark suggests that simpler, more direct methods currently offer superior control.

Performance in Concept Detection

AxBench also evaluates the ability of different methods to detect concepts within LLM representations. This task assesses how well a method can identify if a given concept is active or represented internally for a specific input.

The findings for concept detection diverge somewhat from the steering results:

Representation-Based Methods (Non-SAE): Techniques like difference-in-means (DiffMean) vectors performed strongly on concept detection. DiffMean involves computing the difference between the average representations of positive and negative examples for a given concept. Its success suggests that simple linear structures within the activation space effectively capture concept presence.
Sparse Autoencoders (SAEs): Similar to the steering task, SAEs were found to be not competitive for concept detection on AxBench. This implies that the features learned by SAEs, at least with current training methodologies and architectures used in the paper, may not align well with the specific concepts targeted by the benchmark or may not provide a sufficiently discriminative signal for detection compared to simpler methods like DiffMean.

The strong performance of DiffMean highlights that even straightforward linear methods applied to LLM representations can be effective for identifying semantic concepts, potentially offering interpretability without the complexities and apparent performance limitations (on these tasks) of SAEs.

Rank-1 Representation Finetuning (ReFT-r1)

Addressing the limitations observed, particularly the performance gap between prompting/finetuning and existing representation-based methods, the paper introduces a novel technique: Rank-1 Representation Finetuning (ReFT-r1). This method is designed to be weakly supervised and operate directly on model representations.

ReFT-r1 modifies the standard Representation Finetuning (ReFT) approach by constraining the parameter updates to be rank-1. Specifically, it learns low-rank updates to the weight matrices of targeted layers (e.g., MLP or attention layers). For a weight matrix $W$ , the update $\Delta W$ is constrained to be of the form $uv^T$ , where $u$ and $v$ are vectors. This significantly reduces the number of trainable parameters compared to full finetuning or standard ReFT.

The weak supervision comes from using contrasting pairs of inputs (positive/negative examples related to the target concept) to guide the learning of the rank-1 updates. The objective typically involves maximizing the difference in the model's behavior (e.g., log-probability of a target token) between positive and negative examples, achieved by optimizing $u$ and $v$ .

u = initialize_vector()
v = initialize_vector()

for epoch in range(num_epochs):
    for positive_batch, negative_batch in dataloader:
        # Compute layer output for positive examples with modified W
        W_updated_pos = W + u @ v.T
        output_pos = forward_pass(positive_batch, W_updated_pos)
        loss_pos = compute_steering_loss(output_pos, target="promote") # e.g., maximize logprob of desired token

        # Compute layer output for negative examples with modified W
        W_updated_neg = W + u @ v.T
        output_neg = forward_pass(negative_batch, W_updated_neg)
        loss_neg = compute_steering_loss(output_neg, target="suppress") # e.g., minimize logprob of desired token

        # Combined loss (example: maximize difference)
        loss = (loss_neg - loss_pos) + lambda * (||u||^2 + ||v||^2) # Optional regularization

        # Compute gradients w.r.t u and v
        grad_u, grad_v = compute_gradients(loss, u, v)

        # Update u and v
        u = u - alpha * grad_u
        v = v - alpha * grad_v

On AxBench, ReFT-r1 demonstrated competitive performance on both steering and concept detection tasks. It significantly outperformed SAEs and other representation-based methods in steering, approaching the effectiveness of finetuning in some cases, while also performing well in concept detection, similar to DiffMean. Crucially, ReFT-r1 retains some interpretability advantages associated with representation-based methods, as the rank-1 update $uv^T$ can be analyzed (e.g., examining $v$ as a direction in the input activation space and $u$ as a modification in the output space). It offers a potential middle ground, providing better performance than prior representation intervention techniques while being more targeted and potentially more interpretable than full finetuning.

Implications and Resources

The findings presented in the AxBench paper have significant practical implications for developing safer and more controllable LLMs. The results strongly suggest that, for direct steering performance, practitioners should currently prioritize prompt engineering and finetuning. While representation-based methods hold promise for interpretability, techniques like SAEs, in their current form evaluated here, appear suboptimal for steering and concept detection compared to simpler baselines like DiffMean or the newly proposed ReFT-r1.

ReFT-r1 emerges as a promising direction, offering a parameter-efficient method (learning only $u$ and $v$ per target layer) that achieves competitive performance while potentially allowing for greater insight into the mechanism of control compared to full finetuning or opaque prompting strategies.

The public release of AxBench provides a valuable resource for the community to benchmark future steering and interpretability methods rigorously. Additionally, the released feature dictionaries for ReFT-r1 and DiffMean on Gemma-2 models allow researchers to readily apply and build upon these techniques.

Conclusion

The introduction of AxBench provides a much-needed framework for the empirical evaluation of LLM steering and concept detection methods. The paper's results challenge the practical efficacy of SAEs for these specific tasks, finding that prompting and finetuning remain superior for steering, while methods like DiffMean excel at concept detection. The proposed ReFT-r1 technique presents a competitive, parameter-efficient alternative that bridges performance and potential interpretability, offering a promising avenue for future research in representation-based model control.

PDF Markdown

Related Papers

Tweets

https://twitter.com/stanfordnlp/status/1905042240277487811

https://twitter.com/aryaman2020/status/1885026805901647886

https://twitter.com/sidnbaskaran/status/1931053812019835245

https://twitter.com/JacquesThibs/status/1905307055713366094

YouTube

Show All Videos