Functional role of MoDA’s altered attention patterns

Investigate the functional role of the attention patterns observed under mixture-of-depths attention (MoDA) that differ from typical attention sink behavior by characterizing how probability mass is distributed between sequence key–value blocks and depth key–value blocks and determining the implications of this redistribution for long-context modeling.

Background

The paper visualizes attention heatmaps for a 700M-parameter model trained with MoDA and observes substantial attention mass on depth key–value entries, especially in middle and late layers. This suggests that MoDA actively retrieves cross-layer information rather than relying solely on sequence-local context.

The authors note that MoDA’s patterns appear to differ from typical attention sink behavior, with probability mass being more broadly distributed across sequence and depth slots. Despite these observations, the precise functional role of these altered patterns is not established, motivating further study.

References

While these patterns are intriguing, their precise functional role remains unclear and warrants further investigation.

Mixture-of-Depths Attention  (2603.15619 - Zhu et al., 16 Mar 2026) in Section 5, Analyzing MoDA with Attention Visualization