Investigating task-specific prompts and sparse autoencoders for activation monitoring (2504.20271v1)

Published 28 Apr 2025 in cs.LG

Abstract: LLMs can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of LLMs encode additional information that could be useful for this. The baseline approach for activation monitoring is some variation of linear probing on a particular layer: starting from a labeled dataset, train a logistic regression classifier on that layer's activations. Recent work has proposed several approaches which may improve on naive linear probing, by leveraging additional computation. One class of techniques, which we call "prompted probing," leverages test time computation to improve monitoring by (1) prompting the model with a description of the monitoring task, and (2) applying a learned linear probe to resulting activations. Another class of techniques uses computation at train time: training sparse autoencoders offline to identify an interpretable basis for the activations, and e.g. max-pooling activations across tokens using that basis before applying a linear probe. However, one can also prompt the model with a description of the monitoring task and use its output directly. We develop and test novel refinements of these methods and compare them against each other. We find asking the model zero-shot is a reasonable baseline when inference-time compute is not limited; however, activation probing methods can substantially outperform this baseline given sufficient training data. Specifically, we recommend prompted probing when inference-time compute is available, due to its superior data efficiency and good generalization performance. Alternatively, if inference-time compute is limited, we find SAE-based probing methods outperform raw activation probing.

Summary

The paper demonstrates that prompted last token probing significantly improves data efficiency in low-data settings.
SAE max-pooled probing achieves competitive performance with reduced inference compute, though its effectiveness varies by task.
Practical guidelines are provided to select between zero-shot, prompted, and raw activation probing based on data and computational resources.

LLMs can sometimes produce unexpected or unsafe outputs. Monitoring these outputs is crucial for safety in real-world applications. While monitoring can be done by prompting another LLM to evaluate the output or by probing the internal activations of the model, both methods have limitations. Prompting requires extra computation during inference but needs little to no labeled data. Probing can potentially be done using activations already computed during the model's forward pass (making it computationally "free" at inference), but typically requires a substantial amount of labeled data to train a classifier on the high-dimensional activation vectors.

This paper, "Investigating task-specific prompts and sparse autoencoders for activation monitoring" (2504.20271), systematically investigates and compares different approaches to activation monitoring, including variations of linear probing, methods leveraging sparse autoencoders (SAEs), and methods combining prompting with probing. The goal is to understand the trade-offs between these techniques under various constraints, particularly concerning labeled data availability and inference-time compute. The authors hypothesize that prompting the model with the monitoring task description can improve probing performance by making the task-relevant information more accessible in the activations, especially in low-data settings.

The paper uses a snapshot of ChatGPT-4o and evaluates performance on four diverse tasks: moderation (harassment and violence detection), hallucination detection (on SimpleQA responses), and sentiment analysis (on RottenTomatoes movie reviews). The datasets are split 80/20 for training and testing, with experiments varying the amount of training data used. Generalization performance is also tested by training on one data distribution (e.g., English) and testing on another (e.g., non-English).

Methods:

Passage Processing: Input passages are formatted based on the monitoring approach. For standard probing, raw text is converted to a simple chat format. For prompted probing, both raw text and existing chat conversations are converted to a plaintext "User: ... Assistant: ..." format and then embedded within a task-specific prompt template. Two prompted templates are explored:
- Prefix+Suffix: Includes a task description before the passage (e.g., "Please evaluate...violence.") and a question after the passage (e.g., "Does the passage contain violence?").
- Suffix-only: Includes the task description only after the passage. This is more compute-efficient when monitoring for multiple categories in parallel as the initial forward pass can be shared.
Activation Extraction: Residual stream activations are saved from several evenly spaced layers throughout the model. Probes are trained on either the activations of the last token or a max-pooled aggregation of activations across all tokens in the passage.
Sparse Autoencoders (SAEs): A SAE is pre-trained and fine-tuned on model activations. For probing, the TopK activation function used during training is replaced with a JumpReLU function (Rajamanoharan et al., 19 Jul 2024) at inference to allow independent computation of latent activations. Probes are trained on either SAE latent activations (post-nonlinearity) or SAE pre-activations (pre-nonlinearity). Max-pooling is applied to SAE activations/pre-activations across tokens before probing.
Linear Classifier Training: Logistic Regression classifiers (using scikit-learn) are trained on the extracted activations.
- Raw activation probes use the lbfgs solver and L2 regularization.
- SAE latent/pre-activation probes use the saga solver and L1 regularization.
- For SAE-based probes, a feature selection method based on the mean-difference metric (Gurnee et al., 2023) is used to select the top $Q$ most relevant latents/pre-activations. Hyperparameters like the regularization parameter $C$ and the number of features $Q$ are tuned.
Baselines:
- Zero-Shot Prompting: The model is directly prompted with the task using the prefix+suffix template, and the difference between the Yes and No output logits is used for classification.
- Few-Shot Prompting: A few labeled examples are included in the prompt along with the test example.
- LAT Scan (Zou et al., 2023): A simplified, unsupervised version of Linear Artificial Tomography is implemented by computing the first principal component of the difference matrix of last-token activations for training examples and projecting test examples onto this component.
Combining Methods: A second-level classifier is trained on the combination of the linear probe's output logit and the difference between the model's Yes/No output logits.

Key Results and Findings:

Prompting is Effective: Prompted last token probing generally demonstrates the best data efficiency, performing strongly in low-data regimes across various tasks (Figure 1). The performance gap between prompted and un-prompted last token probing is most significant when labeled data is scarce.
SAE Probing Performance: SAE max-pooled probing performs comparably to prompted probing in high-data regimes. It substantially outperforms raw activation probing when inference-time compute is limited (Figure 1). Max-pooling across tokens is shown to be crucial for SAE probe performance, replicating findings from previous work [2024/features-as-classifiers/, (Kantamneni et al., 23 Feb 2025)]. However, max-pooling hurts performance for raw activation probing (Figure 2).
Zero-Shot Baseline: Asking the model zero-shot serves as a strong baseline when inference-time compute is not a constraint. However, with sufficient training data, probing methods can often outperform this baseline, particularly for tasks like hallucination detection where the model's internal state might be more reliable than its direct output (Orgad et al., 3 Oct 2024).
Generalization: Prompted probing tends to generalize better than SAE max-pooled probing across distribution shifts (English/non-English, Chat/non-Chat), maintaining higher out-of-distribution performance for a given in-distribution performance (Figure 3). SAE max-pooled probes sometimes showed worse OOD performance compared to naive raw activation probes.
Sentiment Analysis Anomaly: On the sentiment task, max-pooled SAE probing underperformed last token probing, suggesting that for some tasks, signal might be concentrated at the final token, and max-pooling across other tokens adds noise. Probing SAE pre-activations or using SAE latents only at the last token showed better results for sentiment (Figure 4).
Combining Methods: Combining prompted probing with SAE-based probing or combining prompted probing with the model's direct output does not yield substantial performance improvements over the best individual method (Figures 4.5, 5).
Optimizing Methods:
- Mid-to-late layers (around 75% depth) yield optimal probe performance (Figure 5).
- Suffix-only prompting performs nearly as well as prefix+suffix prompting and is more computationally efficient, making it the preferred method for parallel monitoring tasks (Figure 6).
- For SAE probes, performance is relatively insensitive to the number of selected features $Q$ above a certain threshold ( $Q=1000$ was found to be generally optimal, but performance is competitive even with $Q=10$ in some cases) (Figure 7).
Comparison with Baselines:
- Adding few-shot examples to the prompt did not consistently improve zero-shot model output classification and sometimes hurt performance (Figure 8.5).
- The LAT Scan method, while competitive in some low-data settings, did not outperform the zero-shot model baseline or the best supervised probing methods in high-data regimes (Figure 9). This contrasts with findings in previous work and might be due to the superior capabilities of GPT-4o.
- Probing SAE pre-activations performed similarly to probing SAE latent activations (Figure 10).

Practical Recommendations:

Based on the results, the authors provide practical guidance:

If inference compute is unlimited: Zero-shot prompting is a strong starting point.
If plentiful training data is available: Use activation probing.
- Prompted probing is recommended if inference compute is available, due to its data efficiency and good generalization.
- SAE probing is recommended if training compute is available (for training the SAE) but inference compute is limited (since SAE features can be computed during the main forward pass).
If both inference and training compute are limited: Straightforward raw activation probing is a reasonable choice, with performance scaling with data availability.

The paper also discusses the potential risks of "collusion" in monitoring systems where the monitor and the potentially deceptive model share weights or use additional forward passes that could trigger deceptive behavior. Side-car methods without additional forward passes (like non-prompted probes) might be safer in such scenarios. Limitations include not comparing against fine-tuning the model for the monitoring task (which might be more performant), potential for better zero-shot performance with more careful prompt engineering or use of reasoning tokens, and the need for further work to understand the relationship between signal concentration in activations and probe effectiveness.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (2)

Tweets

https://twitter.com/HenkTillman/status/1917634171926896822

https://twitter.com/fly51fly/status/1918786670738170085

Investigating task-specific prompts and sparse autoencoders for activation monitoring (2504.20271v1)

Summary

Follow-up Questions

Related Papers

Authors (2)

Tweets