Extend sparse attention decomposition to model diffing

Develop sparse decomposition techniques for transformer attention mechanisms that enable interpretable comparison of attention computations and parameters between two language models, thereby characterizing the differences learned during fine-tuning. Specifically, extend existing low-rank sparse attention decomposition methods to the comparative setting required for model diffing so that attention changes between a base model and its fine-tuned variant can be analyzed analogously to differences captured by transcoder adapters in MLPs.

Background

Transcoder adapters in this work provide a sparse, interpretable account of changes in MLP computation between a base and fine-tuned reasoning model, but they do not address changes to non-MLP parameters such as attention and embeddings. The authors confirm via a hybrid baseline that attention and embedding changes alone are insufficient to reproduce reasoning behavior, leaving attention differences unexplained by their method.

Recent research has begun to decompose attention using sparse methods (e.g., low-rank sparse attention decomposition), suggesting a pathway to interpret attention mechanisms. However, these approaches have not yet been extended to the comparative setting of model diffing, where the goal is to isolate and interpret differences in attention between two models. The authors explicitly note this gap as an open question and highlight it as a key limitation to be addressed in future work.

References

While recent work has begun decomposing attention using sparse methods [He 2025], extending this to study differences between models remains an open question.

— Transcoder Adapters for Reasoning-Model Diffing (2602.20904 - Hu et al., 24 Feb 2026) in Conclusion, Limitations and Future Work

Extend sparse attention decomposition to model diffing

Background

References

Related Problems