Simple per-step gate vs. cumulative forget gate in attention
Determine whether replacing the cumulative-product forget gate typically used in gated attention mechanisms, where the gate is defined as gi(xi:t) = product over j from i+1 to t of aj with aj produced by a linear layer followed by a sigmoid on hidden states xj, by a simple per-step gate gi(xi) = WG(xi) achieves a similar effect and performance while simplifying implementation in Transformer attention layers.
References
We conjecture that defining the gating function simply as gi(xi:t) = WG(xi) It may achieve a similar effect while being simpler to implement. We leave the investigation of this hypothesis for future work.
                — Understanding Transformer from the Perspective of Associative Memory
                
                (2505.19488 - 2505.19488) in Section 2.1.2 (Vignette 1: Rethinking Attention and FFN), Gating; after Eq. (31)