Mechanism Driving Architecture-Specific Alpha Scaling
Ascertain which component of the recurrent-state update is responsible for the observed architecture-specific difference in the optimal initial-state scaling parameter alpha between GatedDeltaNet and Mamba-2; specifically, determine whether the scalar decay gate or the key-dependent erasure term in GatedDeltaNet (as contrasted with SSD gating in Mamba-2) drives the approximately tenfold difference in optimal alpha required for effective S tuning.
References
GatedDeltaNet combines scalar decay with key-dependent erasure, while Mamba-2 uses SSD gating; our experiments do not isolate which part of the update drives the alpha shift, only that transferring the Qwen setting unchanged leaves a large amount of Falcon performance on the table.
— S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
(2604.01168 - Young, 1 Apr 2026) in Appendix: Alpha Sweep