Explain the shift in optimal SSM fraction with increasing sequentiality of hybrid block arrangement

Ascertain the mechanism that causes the optimal fraction of State Space Model (SSM) channels to decrease as the hybrid mixer block arrangement becomes more sequential in the Falcon-H1 architecture. Specifically, under fixed attention allocation α_A = 1/8 and varying SSM/MLP channels with α_S + α_M = 7/8, determine why the optimal SSM fraction α_S shifts from 3/8 (fully parallel SAM) to 2/8 (semi-parallel SA_M) to 1/8 (fully sequential S_A_M), and provide a principled explanation for the associated performance differences observed across these block configurations.

Background

Within the Falcon-H1 hybrid architecture, the authors study how to allocate channels among attention, SSM, and MLP components and how to arrange these components either in parallel or sequentially. They compare three block configurations: fully parallel (SAM), semi-parallel (SA_M), and fully sequential (S_A_M), while fixing attention channels at α_A = 1/8 and varying SSM/MLP channels so that α_S + α_M = 7/8.

Their experiments show that the semi-parallel SA_M configuration yields the lowest loss, with optimal allocation (α_S, α_A, α_M) = (2/8, 1/8, 5/8). They further observe that as the configuration becomes more sequential (SAM → SA_M → S_A_M), the optimal SSM fraction α_S consistently decreases from 3/8 to 2/8 to 1/8. The authors explicitly state that they currently lack an explanation for this trend, motivating a clear open problem to understand and justify this behavior.

References

Interestingly, as block configuration becomes more sequential SAM→SA_M→S_A_M, the optimal SSM fraction reduces as 3/8→2/8→1/8. At the moment, we don't have an explanation of this behavior.

Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance (2507.22448 - Zuo et al., 30 Jul 2025) in Section 2 (Architecture) → Channel Allocation → The results paragraph; Figure 2 (channel_ablations)