Understanding contributions in composite steering of large language models

Characterize the contribution of each individual steering intervention to the final generated output when multiple steering controls are composed within a single large language model inference run, and model the non-linear interactions among controls acting on input, structural, state, and output surfaces (e.g., activation addition, post-hoc attention steering, fine-tuning, and decoding-time alignment) to enable reliable attribution and ordering effects analysis.

Background

The paper introduces steering pipelines that can compose multiple controls across different model control surfaces (input, structural, state, and output). While such compositions are increasingly studied, the interactions between controls are often non-linear, making it difficult to attribute observed behaviors to specific interventions.

The authors specifically highlight that understanding the individual impact of each composed method and how composition order affects outcomes is not well understood. The toolkit enables experimentation with such compositions, but a theoretical and empirical characterization remains open.

References

In general, the contribution of each intervention on the final output is not well understood, largely due to non-linear interactions.

— AI Steerability 360: A Toolkit for Steering Large Language Models (2603.07837 - Miehling et al., 8 Mar 2026) in Section: Additional toolkit features, paragraph “Composite steering”.

Understanding contributions in composite steering of large language models

Background

References

Related Problems