- The paper presents Dynamic Activation Composition, using adaptive KL divergence to modulate steering intensity for effective multi-property control.
- It evaluates fixed, diminishing, and dynamic strategies, demonstrating that dynamic activation best balances fluency with sustained conditioning.
- The approach enhances LLM controllability and interpretability, paving the way for scalable, robust deployments in varied real-world applications.
Multi-property Steering of LLMs with Dynamic Activation Composition
Activation steering methods have demonstrated their effectiveness in conditioning the output of LLMs by manipulating intermediate model representations during inference time. Despite their potential, prior evaluations were constrained to single property conditioning and synthetic scenarios. The paper "Multi-property Steering of LLMs with Dynamic Activation Composition" by Daniel Scalena, Gabriele Sarti, and Malvina Nissim, extends this research by offering a comprehensive evaluation of various activation steering methods and introducing a new approach named Dynamic Activation Composition (Dyn). This approach aims to ensure robust conditioning across multiple properties during the generation process without sacrificing fluency.
Methodology and Experimental Setup
Dynamic Activation Composition: The primary innovation introduced by the authors is the Dynamic Activation Composition method, an information-theoretic approach that adjusts steering intensity dynamically at each generation step. Unlike previous fixed or diminishing strategies, this method leverages the Kullback-Leibler divergence (KL) between the probability distributions of the original and steered models to dynamically modulate the strength of intervention. This adaptive method aims to balance the intensity of conditioning throughout the generation process, thereby maintaining high fluency.
Activation Extraction and Injection:
- Activation Extraction: Contrasting pairs showcasing opposite properties are used to derive steering vectors from the LLM’s intermediate activations.
- Activation Injection: The derived steering vectors are injected into model activations during generation, modulated by a scaling factor α. Fixed, diminishing, and dynamic strategies are evaluated to determine the optimal steering intensity for various properties (language, safety, formality).
Datasets:
- Alpaca dataset, focusing on language conditioning, was translated into Italian, French, Spanish, and Chinese to provide a multilingual conditioning benchmark.
- BeaverTails dataset, used to evaluate and steer the safety aspect of model generations.
- GYAFC and XFORMAL datasets for assessing the formality of generated text, available in multiple languages.
Results
Single-property Steering:
- The Start strategy, which applies steering only at the initial token, proved insufficient for sustained conditioning, especially for language steering.
- Fixed and Dim strategies showed better performance for maintaining conditioning throughout the generation. Notably, high α values resulted in disfluent generations, highlighting a trade-off between conditioning accuracy and fluency.
- Dynamic activation composition emerged as the most effective strategy, particularly for properties requiring consistent steering intensity.
Multi-property Steering:
- In multi-property settings, the Dyn approach delivered the best trade-off between achieving multiple conditioning objectives and maintaining generation fluency, demonstrating that dynamic adjustment during generation could accommodate the complexities of steering multiple properties simultaneously.
Implications and Future Directions
The findings from this paper have significant theoretical and practical implications:
- Enhanced Controllability: Dyn provides a robust mechanism for simultaneously steering multiple properties, thereby increasing the utility and safety of deploying LLMs in real-world applications.
- Scalability: This dynamic approach minimizes the need for manual tuning of steering intensities, making it a scalable solution for diverse applications requiring complex conditional generation.
- Model Interpretability: The method aids in understanding how different properties are encoded within LLMs, fostering greater transparency in model behavior.
Future work could extend these techniques across larger and diverse LLMs, examining how activation steering methods scale with model size and training paradigms (e.g., instruction-tuned vs. RLHF-aligned models). Additionally, coupling human evaluation with automated metrics could provide a more holistic assessment of fluency and appropriateness in complex steering scenarios.
Conclusion
The research on Dynamic Activation Composition introduces a novel and effective methodology for multi-property steering in LLMs, addressing the limitations of prior fixed and diminishing strategies. By dynamically modulating steering intensity, this approach maintains robust property conditioning and fluency in generated text, paving the way for more controllable and interpretable LLM deployments.