Mechanisms underlying instruction-tuning’s degradation of in-context steerability
Ascertain the primary causes by which instruction-tuned large language models exhibit reduced in-context steerability compared to pretrained models by rigorously testing the hypothesized mechanisms—strong instruction-induced priors that resist override by in-context examples, over-optimization on single-ground-truth tasks, and benchmark overfitting—and quantifying their relative contributions.
References
What explains this difference? While we leave an in-depth exploration of this phenomenon to future work, we hypothesize that it could be due to some combination of 1) instruction-tuning inducing very strong priors that are difficult to override even with in-context demonstrations, 2) over-optimization on tasks with a single ground truth, or 3) overfitting to particular benchmarks.