Dice Question Streamline Icon: https://streamlinehq.com

Applying Scaled DIT Adapters to Base Models for Self-Analysis

Determine whether Diff Interpretation Tuning (DIT) adapters, when scaled up, can be applied directly to the un-finetuned base language model M to answer self-referential questions about the model’s own behaviors, such as identifying which behaviors its creators would find most concerning.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper shows that DIT adapters trained on specific types of weight diffs generalize poorly across qualitatively different behaviors, suggesting a need to scale and diversify training to improve generalization.

If scaling resolves generalization issues, a key question is whether the same adapter could be applied to the base model M, enabling the model to introspect and report on its own behaviors without requiring a finetuned diff.

References

Another interesting open question is whether scaled-up DIT adapters could be applied to the base model $M$ to answer interesting questions about itself (e.g. ``Which of your behaviors would your creators find most concerning?'').

Learning to Interpret Weight Differences in Language Models (2510.05092 - Goel et al., 6 Oct 2025) in Section 6.1 (Generalization to weight diffs encoding different behaviors)