Applying Scaled DIT Adapters to Base Models for Self-Analysis

Determine whether Diff Interpretation Tuning (DIT) adapters, when scaled up, can be applied directly to the un-finetuned base language model M to answer self-referential questions about the model’s own behaviors, such as identifying which behaviors its creators would find most concerning.

Background

The paper shows that DIT adapters trained on specific types of weight diffs generalize poorly across qualitatively different behaviors, suggesting a need to scale and diversify training to improve generalization.

If scaling resolves generalization issues, a key question is whether the same adapter could be applied to the base model M, enabling the model to introspect and report on its own behaviors without requiring a finetuned diff.

References

Another interesting open question is whether scaled-up DIT adapters could be applied to the base model $M$ to answer interesting questions about itself (e.g. ``Which of your behaviors would your creators find most concerning?'').

— Learning to Interpret Weight Differences in Language Models (2510.05092 - Goel et al., 6 Oct 2025) in Section 6.1 (Generalization to weight diffs encoding different behaviors)

Applying Scaled DIT Adapters to Base Models for Self-Analysis

Background

References

Related Problems