Mechanism of DIT: Weights vs. Activations

Determine whether Diff Interpretation Tuning (DIT) primarily operates by acting on weight parameters or on internal activations when enabling language models to describe finetuning-induced modifications. Specifically, ascertain whether the learned DIT adapter maps applied weight differences to natural-language descriptions via weight-space interactions or instead functions by interpreting activation-space representations resulting from those weight differences.

Background

Diff Interpretation Tuning (DIT) trains a LoRA adapter so that, when applied to a finetuned model, the model can describe how its weights have changed. Although DIT is framed as teaching models to interpret weight differences, the learned adapter may also be leveraging or modifying internal activations to accomplish this task.

Clarifying whether DIT’s performance stems from weight-space computation or from activation-space introspection would sharpen our understanding of what the adapter learns, inform the design of better introspection architectures, and guide measurement and evaluation strategies for interpreting weight diffs.

References

A better understanding of whether DIT primarily acts on weights or activations is an open question we are excited about (c.f. \Cref{sec:how-does-introspection-work}).

— Learning to Interpret Weight Differences in Language Models (2510.05092 - Goel et al., 6 Oct 2025) in Related work (Interpreting model activations paragraph)

Mechanism of DIT: Weights vs. Activations

Background

References

Related Problems