Mechanism of DIT: Weights vs. Activations
Determine whether Diff Interpretation Tuning (DIT) primarily operates by acting on weight parameters or on internal activations when enabling language models to describe finetuning-induced modifications. Specifically, ascertain whether the learned DIT adapter maps applied weight differences to natural-language descriptions via weight-space interactions or instead functions by interpreting activation-space representations resulting from those weight differences.
References
A better understanding of whether DIT primarily acts on weights or activations is an open question we are excited about (c.f. \Cref{sec:how-does-introspection-work}).
— Learning to Interpret Weight Differences in Language Models
(2510.05092 - Goel et al., 6 Oct 2025) in Related work (Interpreting model activations paragraph)