Simple per-step gate vs. cumulative forget gate in attention

Determine whether replacing the cumulative-product forget gate typically used in gated attention mechanisms, where the gate is defined as gi(xi:t) = product over j from i+1 to t of aj with aj produced by a linear layer followed by a sigmoid on hidden states xj, by a simple per-step gate gi(xi) = WG(xi) achieves a similar effect and performance while simplifying implementation in Transformer attention layers.

Background

In the paper’s discussion of gating, the authors contrast how attention layers and feed-forward networks (FFNs) employ gating. Attention gating is commonly interpreted as a forgetting mechanism, implemented as a cumulative product of gates between 0 and 1 across timesteps, typically via sigmoid functions. In contrast, FFNs use local, per-step multiplicative gating (e.g., SwiGLU) without cumulative products.

Motivated by this asymmetry, the authors conjecture that a simpler, per-step gate for attention—computed directly from the current hidden state via a learned linear map WG—might replicate the functional benefits of the cumulative forget gate while being easier to implement. They explicitly leave verifying this conjecture to future work.

References

We conjecture that defining the gating function simply as gi(xi:t) = WG(xi) It may achieve a similar effect while being simpler to implement. We leave the investigation of this hypothesis for future work.

— Understanding Transformer from the Perspective of Associative Memory (2505.19488 - 2505.19488) in Section 2.1.2 (Vignette 1: Rethinking Attention and FFN), Gating; after Eq. (31)

Simple per-step gate vs. cumulative forget gate in attention

Background

References

Related Problems