Disentangling FFN computation from intermediate activations without harming trainability

Establish whether feed-forward network computation in dense transformer-based language models can be disentangled from intermediate activations—including residual stream features and self-attention outputs—without significantly degrading model trainability.

Background

The paper reviews prior analyses that interpret transformer feed-forward networks (FFNs) as key–value memories but notes that these studies rely on contextualized residual activations, making the query space indirect and hard to interpret. The authors highlight a key uncertainty: whether FFN computation in dense LLMs can be decoupled from intermediate representations while still retaining trainability. MemoryLLM is proposed to address this by training FFNs directly on context-free token embeddings, aiming for deterministic interpretability and reduced dependence on the residual stream.

References

It remains unclear if FFN computation within dense LLMs can be disentangled from any intermediate activations without significantly hurting model trainability.

MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers  (2602.00398 - Jaiswal et al., 30 Jan 2026) in Appendix, Background Work — Understanding Feed-Forward Networks in Transformers