Mechanism Driving Architecture-Specific Alpha Scaling

Ascertain which component of the recurrent-state update is responsible for the observed architecture-specific difference in the optimal initial-state scaling parameter alpha between GatedDeltaNet and Mamba-2; specifically, determine whether the scalar decay gate or the key-dependent erasure term in GatedDeltaNet (as contrasted with SSD gating in Mamba-2) drives the approximately tenfold difference in optimal alpha required for effective S tuning.

Background

The study finds that the optimal scaling factor for injecting the learned initial state differs substantially across architectures: α≈0.07 for GatedDeltaNet in Qwen3.5-4B versus α≈0.60–0.70 for Mamba-2 in FalconH1-7B.

While the authors hypothesize that differences in gating dynamics across recurrence families underlie this disparity, they explicitly state that their experiments did not isolate which part of the update is causally responsible for the alpha shift.

References

GatedDeltaNet combines scalar decay with key-dependent erasure, while Mamba-2 uses SSD gating; our experiments do not isolate which part of the update drives the alpha shift, only that transferring the Qwen setting unchanged leaves a large amount of Falcon performance on the table.

S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models  (2604.01168 - Young, 1 Apr 2026) in Appendix: Alpha Sweep