Functional role of MoDA’s altered attention patterns
Investigate the functional role of the attention patterns observed under mixture-of-depths attention (MoDA) that differ from typical attention sink behavior by characterizing how probability mass is distributed between sequence key–value blocks and depth key–value blocks and determining the implications of this redistribution for long-context modeling.
References
While these patterns are intriguing, their precise functional role remains unclear and warrants further investigation.
— Mixture-of-Depths Attention
(2603.15619 - Zhu et al., 16 Mar 2026) in Section 5, Analyzing MoDA with Attention Visualization