Explain the shift in optimal SSM fraction with increasing sequentiality of hybrid block arrangement
Ascertain the mechanism that causes the optimal fraction of State Space Model (SSM) channels to decrease as the hybrid mixer block arrangement becomes more sequential in the Falcon-H1 architecture. Specifically, under fixed attention allocation α_A = 1/8 and varying SSM/MLP channels with α_S + α_M = 7/8, determine why the optimal SSM fraction α_S shifts from 3/8 (fully parallel SAM) to 2/8 (semi-parallel SA_M) to 1/8 (fully sequential S_A_M), and provide a principled explanation for the associated performance differences observed across these block configurations.
Sponsor
References
Interestingly, as block configuration becomes more sequential SAM→SA_M→S_A_M, the optimal SSM fraction reduces as 3/8→2/8→1/8. At the moment, we don't have an explanation of this behavior.