Mechanisms underlying instruction-tuning’s degradation of in-context steerability

Ascertain the primary causes by which instruction-tuned large language models exhibit reduced in-context steerability compared to pretrained models by rigorously testing the hypothesized mechanisms—strong instruction-induced priors that resist override by in-context examples, over-optimization on single-ground-truth tasks, and benchmark overfitting—and quantifying their relative contributions.

Background

The authors empirically show that, across diverse distributional tasks in Spectrum Suite, instruction-tuned models often perform worse than their pretrained counterparts at steering to novel in-context distributions. This degradation contrasts with capability elicitation tasks where instruction-tuned models generally improve.

While presenting initial hypotheses, the authors explicitly defer a thorough investigation to future work, indicating that the causal mechanisms behind this steerability reduction are not established. Clarifying these mechanisms would inform post-training designs that preserve or enhance steerability without sacrificing other desirable properties.

References

What explains this difference? While we leave an in-depth exploration of this phenomenon to future work, we hypothesize that it could be due to some combination of 1) instruction-tuning inducing very strong priors that are difficult to override even with in-context demonstrations, 2) over-optimization on tasks with a single ground truth, or 3) overfitting to particular benchmarks.

— Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability (2510.06084 - Sorensen et al., 7 Oct 2025) in Section 4 (In-Context Steerability), paragraph beginning “What explains this difference?”

Mechanisms underlying instruction-tuning’s degradation of in-context steerability

Background

References

Related Problems