Improving Reasoning Performance in Large Language Models via Representation Engineering (2504.19483v1)

Published 28 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Recent advancements in LLMs have resulted in increasingly anthropomorphic language concerning the ability of LLMs to reason. Whether reasoning in LLMs should be understood to be inherently different is, however, widely debated. We propose utilizing a representation engineering approach wherein model activations are read from the residual stream of an LLM when processing a reasoning task. The activations are used to derive a control vector that is applied to the model as an inference-time intervention, modulating the representational space of the model, to improve performance on the specified task. We publish the code for deriving control vectors and analyzing model representations. The method allows us to improve performance on reasoning benchmarks and assess how control vectors influence the final logit distribution of a model via metrics such as KL divergence and entropy. We apply control vectors to Mistral-7B-Instruct and a range of Pythia models on an inductive, a deductive and mathematical reasoning task. We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations. The intervention is dependent upon the ability to reliably extract the model's typical state when correctly solving a task. Our results suggest that reasoning performance can be modulated in the same manner as other information-processing tasks performed by LLMs and demonstrate that we are capable of improving performance on specific tasks via a simple intervention on the residual stream with no additional training.

Summary

The paper introduces control vectors added to the residual stream to steer LLM reasoning during inference.
It employs methods including averaging, contrastive analysis, and PCA to derive effective vectors from model activations.
Experiments on IOI, bAbI, and GSM8K demonstrate improved accuracy and cross-task generalization in reasoning.

This paper proposes a method to improve the reasoning performance of LLMs by applying interventions directly to their internal representations, specifically within the residual stream. The core idea is to derive "control vectors" from the model's activations when processing examples of successful reasoning and then add these vectors to the activations during inference to encourage a desired reasoning behavior. This approach falls under the domain of representation engineering, treating reasoning ability as a modifiable direction in the model's latent space.

The authors conceptualize the transformer architecture as a process where computational blocks (attention and MLP layers) read from and write to a residual stream. The hidden state vector at layer $\ell$ is denoted $x_\ell$ . Their intervention is modeled as a simple addition to the residual stream after the MLP block in layer $\ell$ , described by the equation:

$x_{\ell+1} = \text{LayerNorm}(y_\ell + \text{MLP}(y_\ell)) + c_\ell \cdot \alpha$

where $y_\ell$ are the activations after the attention mechanism, $c_\ell$ is the layer-specific control vector, and $\alpha$ is a scalar that controls the magnitude and direction of the intervention.

The control vectors $c_\ell$ are derived from hidden state activations $H_\ell(P_i)$ obtained by processing a set of training prompts $P$ . Three methods for deriving $c_\ell$ are explored:

Reading Vector: Averaging activations over a set of prompts: $c_\ell = \frac{1}{|P|}\sum_{i=1}^{|P|} H_{\ell}({P_i})$ .
Contrastive Reading Vector: Averaging the difference between activations from positive ( $P^+$ ) and negative ( $P^-$ ) prompt pairs: $c_\ell = \frac{1}{\left|P^\pm\right|}\sum_{i=1}^{\left|P^\pm\right|} \left(H_{\ell}\left(P^+_i\right) - H_{\ell}\left(P^-_i\right) \right)$ . For reasoning tasks, positive examples are where the model reasoned correctly, and negative examples aim to capture representations of poor reasoning (tested schemes included incorrect model outputs and random character strings).
PCA Contrastive Vector: Applying Principal Component Analysis to the set of difference vectors ( $H_{\ell}(P^+_i) - H_{\ell}(P^-_i)$ ) and using the first principal component as the control vector. This is scaled to have a norm similar to the average activation norm for comparison.

The authors evaluate this method on three reasoning tasks:

Indirect-Object-Identification (IOI): A simple inductive task involving identifying the indirect object in sentences like "Mary and John went to the store. John gave the groceries to Mary."
bAbI Task 15: A deductive reasoning task requiring chaining facts from a short passage to answer a question.
GSM8K: A dataset of grade school mathematical word problems requiring multi-step reasoning and calculation.

Experiments are conducted on Pythia-1.4B, Pythia-2.8B, and Mistral-7B-Instruct models. Control vectors are derived using examples from a training split of each dataset and applied only to the middle layer of the model at the final token. Performance is evaluated on a test set using logit-based accuracy (checking if the correct answer token has the highest logit among potential answers, instead of strict exact match) and analyzed using metrics like KL Divergence and Entropy of the logit distribution, and the average probability of correct vs. incorrect tokens as a function of $\alpha$ .

Key findings include:

Applying control vectors can improve performance on the specified reasoning tasks across different models.
The optimal scaling factor $\alpha$ varies by model and task, sometimes requiring a negative $\alpha$ value (e.g., for GSM8K on Mistral).
For smaller Pythia models, slight accuracy improvements were observed on the IOI task, with metrics showing the intervention's effect on the logit distribution.
For Mistral-7B-Instruct, improvements were seen on the bAbI and notably the more complex GSM8K tasks.
A significant finding is that control vectors derived from one reasoning task (bAbI) can improve performance on a different reasoning task (GSM8K) and vice versa, suggesting the control vector captures a more general "reasoning" related direction in the model's latent space.
Qualitative examples show that applying the intervention can influence the model's generated reasoning trace in GSM8K, leading to a correct answer where the original trace failed.

From an implementation perspective, the method requires:

Access to model internals to extract hidden state activations. Libraries like HuggingFace Transformers can be used for loading models, and frameworks like PyTorch or TensorFlow allow accessing intermediate layer outputs.
A dataset of task examples, ideally with corresponding correct outputs. For contrastive learning, examples where the model succeeds and fails are needed. The paper suggests using few-shot examples for both training vector derivation and inference to ground the model.
Implementation of the control vector calculation logic (averaging, PCA on differences).
A mechanism to modify the model's forward pass during inference to add the scaled control vector to the residual stream at specified layers and tokens. This might involve custom forward hooks or modifying the model's architecture definition.
Evaluation scripts to calculate metrics like logit-based accuracy, KL divergence, and entropy.

The authors provide code publicly, which would be essential for reproducing and extending this work.

Practical considerations for applying this method:

Computational Cost: Extracting activations requires multiple forward passes over the training data. Applying the control vector during inference adds a small computational overhead (vector addition). Deriving control vectors (especially PCA) might require storing and processing large matrices of activations.
Data Requirements: The effectiveness depends on having a representative set of positive and negative examples to derive the control vector. Defining "unsuccessful reasoning" in practice can be challenging.
Hyperparameter Tuning: The optimal $\alpha$ value needs to be determined, potentially requiring experimentation for each task and model. The choice of which layer(s) to intervene on (the paper focuses on the middle layer) and which token(s) might also require tuning.
Robustness: The jagged trend lines in some results (e.g., Mistral GSM8K) suggest the intervention might be sensitive to the specific $\alpha$ value or examples.
Generalization: While cross-task generalization was observed between bAbI and GSM8K, further research is needed to see how broadly this "reasoning" direction generalizes to other reasoning tasks or domains.

The paper concludes that reasoning performance can be modulated via representation engineering, suggesting that aspects of reasoning are encoded in the residual stream in a manner similar to other model characteristics like sentiment. While acknowledging limitations regarding model scale and task complexity studied, the results, particularly on GSM8K and the cross-task effect, are promising for future work on steerable and potentially more reliable LLMs without explicit training.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos