- The paper introduces a novel activation steering method that uses activation differences to improve instruction-following without updating model weights.
- Experimental results demonstrate enhanced adherence to format, length, and word-specific constraints across various language models.
- The study reveals cross-model transferability, where instruction-tuned vectors boost performance in base models without extra training.
Enhancing Instruction-Following in LLMs via Activation Steering
This paper introduces a mechanistic approach to improve instruction adherence in LLMs using activation steering. The method involves deriving instruction-specific vector representations by computing activation differences between inputs with and without instructions. These vectors allow for guiding models toward adhering to constraints at inference time, without altering model weights, thus offering modular and scalable control over outputs.
Methodology and Experimentation
The paper adopts a contrastive method for extracting vector representations of instructions, guided by the difference in activation values in the residual stream. These vectors adjust the model’s behavior during inference, ensuring adherence to various constraints such as output format, length, and word-specific instructions. The approach is empirically tested across four LLMs: Phi-3, Gemma 2 2B and 9B, and Mistral 7B.
Key findings indicate that steering vectors can improve constraint satisfaction in outputs, both with and without explicit instructions. The paper also explores the compositionality of activation steering by successfully combining multiple instructions and shows the potential of cross-model steering where vectors computed on instruction-tuned models can enhance base model performance.
Results and Implications
- Format Instructions: The steering method significantly increased instruction adherence across various format-related tasks. For instance, Phi-3 showed increased accuracy when responding to instructions like "output in lowercase" or "write in JSON format".
- Length Instructions: Adjusting steering weights allowed dynamic control of output length. The method effectively adhered to sentence-length constraints.
- Word Instructions: Word inclusion and exclusion were improved by steering. Notably, subtracting a vector designed for inclusion effectively worked for exclusion tasks, reducing undesired keyword occurrences.
- Cross-Model Steering: Instruction-tuned vectors were transferable to base models, demonstrating potential for leveraging fine-tuned model capabilities without additional training.
Theoretical and Practical Implications
Theoretically, the research contributes to the understanding of how LLMs internally represent instructions. It opens pathways for more precise control over model outputs, which is crucial for aligning them with user objectives in diverse applications. Practically, the ability to guide model responses via activation steering offers a flexible alternative to data-intensive fine-tuning or retraining, supporting adaptive deployment across varying usage scenarios.
Future Directions
Future research could refine the approach by exploring:
- Impact of varying steering intensities.
- Extension to more complex and multi-layer constraints.
- Enhanced cross-model transfer using different fine-tuning strategies.
The paper illustrates a promising technique in the development of controllable LLMs, indicating that steering via activation can be a viable approach for improving instruction-following capabilities without compromising on real-time adaptability.