Applying Scaled DIT Adapters to Base Models for Self-Analysis
Determine whether Diff Interpretation Tuning (DIT) adapters, when scaled up, can be applied directly to the un-finetuned base language model M to answer self-referential questions about the model’s own behaviors, such as identifying which behaviors its creators would find most concerning.
References
Another interesting open question is whether scaled-up DIT adapters could be applied to the base model $M$ to answer interesting questions about itself (e.g. ``Which of your behaviors would your creators find most concerning?'').
— Learning to Interpret Weight Differences in Language Models
(2510.05092 - Goel et al., 6 Oct 2025) in Section 6.1 (Generalization to weight diffs encoding different behaviors)