Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive Attention Sink in SSMs

Updated 21 January 2026
  • Adaptive Attention Sink (AAS) is a mechanism that stabilizes Structured State Space Models (SSMs) by re-injecting learnable anchor states during long-sequence processing.
  • It integrates grouped finite impulse response filtering with a controlled sink term to maintain consistent receptive fields and prevent state drift.
  • Empirical results show that AAS improves training stability, converges faster, and reduces perplexity in language modeling benchmarks.

Adaptive Attention Sink (AAS) is a mechanism designed to enhance the stability and performance of Structured State Space Models (SSMs) during long-sequence modeling. It is inspired by the "attention sink" phenomenon observed in streaming or windowed Transformer architectures, where the first token(s) in a fixed context window disproportionately attract attention, anchoring the model’s focus and preserving context over extended sequences. By incorporating a controlled sink term that re-injects information from anchor states, AAS enables SSMs to maintain numerical stability and consistent receptive fields, thereby mitigating issues of state drift, vanishing, or explosion over thousands of time steps (Meng et al., 2024).

1. Motivation and Conceptual Basis

In streaming Transformer models, when only a fixed window of Key–Value (KV) pairs is cached, the initial token(s) within that window often become attention attractors or "sinks." This effect prevents the network’s context from drifting during long deployments. SSMs such as Mamba-2, which utilize recurrent matrix multiplications, are vulnerable to numerical instability: the state-update matrix can cause the recurrent hidden state to either decay or explode as the sequence length increases, limiting practical receptive field and training stability.

AAS is introduced in SSMs to replicate the stabilizing influence of these attractors. By periodically re-injecting the memory of anchor states—learnable representations of the stream’s boundary conditions—the mechanism anchors the model's latent dynamics, thus preventing the recurrence from "forgetting" the start of each chunk and ensuring stability without aggressive interventions such as gradient clipping (Meng et al., 2024).

2. Mathematical Formulation

The base recurrence of SSMs follows the form: ht=Atht1+Btxth_t = A_t h_{t-1} + B_t x_t

yt=Cthty_t = C_t^\top h_t

With the Adaptive Attention Sink, the grouped-state recurrence is modified as follows. For a decomposition into Q groups: hti=Atht1i+st+St,i=(tmodQ)h_t^i = A_t h_{t-1}^i + s_t + S_t,\quad i = (t \mod Q)

htj=ht1j,j(tmodQ)h_t^j = h_{t-1}^j,\quad j \neq (t \mod Q)

yt=Cti=0Q1htiy_t = C_t \sum_{i=0}^{Q-1} h_t^i

Here, stRds_t \in \mathbb{R}^d is the grouped FIR-filtered input, and StRdS_t \in \mathbb{R}^d is the attention sink term.

The sink term StS_t is defined as: St=PHtQS_t = P \cdot H_{t-Q} where HtQ=[htQ0; htQ1; ; htQQ1]RQdH_{t-Q} = [h_{t-Q}^0;\ h_{t-Q}^1;\ \ldots;\ h_{t-Q}^{Q-1}] \in \mathbb{R}^{Q d} is the concatenation of anchor states from Q time steps prior, and PRd×QdP \in \mathbb{R}^{d \times Q d} is a learnable sink-projection matrix.

Initialization involves randomizing PP (e.g., using Xavier uniform at scale 1/d1/\sqrt{d}), and the earliest Q states hQ1h_{-Q\dots-1} are replaced by learnable prompt vectors {p0,,pQ1}\{p^0, \dots, p^{Q-1}\}, which are jointly optimized alongside other model parameters.

3. Integration with Grouped Finite Impulse Response (FIR) Filtering

The grouped FIR-filtered input is calculated with learnable coefficients {k0,,kn1}\{k_0,\dots,k_{n-1}\}: st=j=0n1kjBtjxtjs_t = \sum_{j=0}^{n-1} k_j \cdot B_{t-j} x_{t-j}

Within each time step, the update procedure combines FIR filtering, sink injection, and grouped state updates, as summarized in the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for t = 1  T do
    # Compute FIR-filtered input
    s_t  0
    for j in 0n1 do
        s_t += k_j * B_{tj} * x_{tj}
    endfor
    # Retrieve earliest anchor states
    H_old  [h_{tQ}^0;;h_{tQ}^{Q1}]
    # Compute sink injection
    S_t  P * H_old
    # For each group i in 0…Q−1 do
    if i == (t mod Q) then
        h_t^i  A_t * h_{t1}^i + s_t + S_t
    else
        h_t^i  h_{t1}^i
    endif
    endfor
    # Compute output
    y_t  C_t * sum_{i=0}^{Q1}(h_t^i)
endfor

This sequence ensures that long-term memory is periodically re-injected from the anchors, maintaining numerical and contextual integrity over long horizons (Meng et al., 2024).

4. Computational Complexity and Implementation Strategies

The per-step time complexity for GFSSM with the attention sink is:

  • FIR computation: O(ndm)O(n \cdot d \cdot m) if BtB_t is d×md \times m, with practical cost reduced to O(nd)O(n \cdot d) via precomputation.
  • Sink projection: O(Qd2)O(Q \cdot d^2) for naive implementation; reduced to O(rd)O(r \cdot d) using low-rank factorization P=UVP = U V^\top, where U,VRd×rU, V \in \mathbb{R}^{d \times r}, rdr \ll d.
  • State updates and summations: O(dQ)O(d \cdot Q).

Overall, with low-rank optimization, per-step cost remains O(d(n+Q)+rd)O(d \cdot (n + Q) + r \cdot d), maintaining linearity with respect to state dimension and filter order.

Space complexity involves:

  • Storing last QQ grouped states: O(Qd)O(Q \cdot d).
  • Low-rank factors of PP: O(rd)O(r \cdot d).
  • Semiseparable matrices for FIR stages: O(nd)O(n \cdot d).

Exploitation of semiseparable structure permits all major multiplications to be implemented via streaming scans over low-rank generators, further improving efficiency for long-sequence processing (Meng et al., 2024).

5. Empirical Impact and Preliminary Validation

Initial experiments with GFSSM incorporating the attention sink have demonstrated substantial benefits:

  • Training stability: Models lacking sink injection exhibited vanishing or exploding hidden states beyond 8,000 time steps, necessitating aggressive gradient clipping. Incorporation of the attention sink enabled stable training without gradient clipping for streams up to 16,000 tokens.
  • Convergence rate: On WikiText-103, GFSSM plus sink converged in approximately 30,000 gradient updates to a validation perplexity of 21.5, compared to 40,000 updates for GFSSM without sink.
  • Final perplexity: The sink term achieved a 5–8% reduction in perplexity at convergence across several text-modeling benchmarks.

These findings indicate that AAS effectively anchors SSM dynamics, prevents destabilization during long sequence modeling, accelerates convergence, and confers measurable improvements in language modeling quality. A plausible implication is that further tuning and architectural refinements could yield additional efficiency gains and modeling capabilities in future work (Meng et al., 2024).

6. Relation to Transformer Architectures and Broader Significance

By adapting the attention sink phenomenon—previously a feature of windowed Transformers—to SSMs, AAS bridges architectural paradigms between linear-state models and self-attention mechanisms. This integration supports scalable, high-performing sequence modeling with efficient computation and robust long-term context, narrowing performance gaps and further enabling SSMs as alternatives to traditional attention-based models for language and sequential data processing (Meng et al., 2024).

The conceptual unification of boundary-condition anchoring, semiseparable structure, and grouped FIR filtering within a single efficient framework marks AAS as a critical innovation in state-space modeling for large-scale deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Attention Sink (AAS).