Improving Instruction-Following in Language Models through Activation Steering (2410.12877v2)

Published 15 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The ability to follow instructions is crucial for numerous real-world applications of LLMs. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from LLMs and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation. Our code and data are available at https://github.com/microsoft/LLM-steer-instruct.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel activation steering method that uses activation differences to improve instruction-following without updating model weights.
Experimental results demonstrate enhanced adherence to format, length, and word-specific constraints across various language models.
The study reveals cross-model transferability, where instruction-tuned vectors boost performance in base models without extra training.

Enhancing Instruction-Following in LLMs via Activation Steering

This paper introduces a mechanistic approach to improve instruction adherence in LLMs using activation steering. The method involves deriving instruction-specific vector representations by computing activation differences between inputs with and without instructions. These vectors allow for guiding models toward adhering to constraints at inference time, without altering model weights, thus offering modular and scalable control over outputs.

Methodology and Experimentation

The paper adopts a contrastive method for extracting vector representations of instructions, guided by the difference in activation values in the residual stream. These vectors adjust the model’s behavior during inference, ensuring adherence to various constraints such as output format, length, and word-specific instructions. The approach is empirically tested across four LLMs: Phi-3, Gemma 2 2B and 9B, and Mistral 7B.

Key findings indicate that steering vectors can improve constraint satisfaction in outputs, both with and without explicit instructions. The paper also explores the compositionality of activation steering by successfully combining multiple instructions and shows the potential of cross-model steering where vectors computed on instruction-tuned models can enhance base model performance.

Results and Implications

Format Instructions: The steering method significantly increased instruction adherence across various format-related tasks. For instance, Phi-3 showed increased accuracy when responding to instructions like "output in lowercase" or "write in JSON format".
Length Instructions: Adjusting steering weights allowed dynamic control of output length. The method effectively adhered to sentence-length constraints.
Word Instructions: Word inclusion and exclusion were improved by steering. Notably, subtracting a vector designed for inclusion effectively worked for exclusion tasks, reducing undesired keyword occurrences.
Cross-Model Steering: Instruction-tuned vectors were transferable to base models, demonstrating potential for leveraging fine-tuned model capabilities without additional training.

Theoretical and Practical Implications

Theoretically, the research contributes to the understanding of how LLMs internally represent instructions. It opens pathways for more precise control over model outputs, which is crucial for aligning them with user objectives in diverse applications. Practically, the ability to guide model responses via activation steering offers a flexible alternative to data-intensive fine-tuning or retraining, supporting adaptive deployment across varying usage scenarios.

Future Directions

Future research could refine the approach by exploring:

Impact of varying steering intensities.
Extension to more complex and multi-layer constraints.
Enhanced cross-model transfer using different fine-tuning strategies.

The paper illustrates a promising technique in the development of controllable LLMs, indicating that steering via activation can be a viable approach for improving instruction-following capabilities without compromising on real-time adaptability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/alesstolfo/status/1848395081851519054

https://twitter.com/gm8xx8/status/1848484497991279091

https://twitter.com/erichorvitz/status/1848503417184129226

https://twitter.com/erichorvitz/status/1912446023886127426

https://twitter.com/besanushi/status/1848407058464317538

https://twitter.com/fly51fly/status/1847391963370442925

YouTube

Show All Videos