Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Gated Alternation (AMOR)

Updated 28 February 2026
  • Dynamic Gated Alternation (AMOR) is a dual-process sequence model that integrates an efficient SSM backbone with sparse attention, activated by high entropy.
  • It employs an entropy-based gating mechanism to selectively route tokens, achieving up to a 78% reduction in attention computation compared to standard transformers.
  • Empirical evaluations reveal that AMOR attains perfect retrieval accuracy on synthetic tasks while using minimal additional computation for long-range dependencies.

Dynamic Gated Alternation (AMOR), also termed the Adaptive Metacognitive Output Router, is a dual-process sequence modeling architecture that dynamically allocates computation by integrating an efficient State Space Model (SSM) backbone with a sparse attention mechanism. The selection between these computational paths is governed by an entropy-based gating signal, yielding a metacognitive system that routes tokens to more expensive processing only when model uncertainty is high. AMOR is designed to address the inefficiencies of standard transformers, which apply uniform computation at all positions, and mitigate the retrieval limitations of SSMs over long-range dependencies by leveraging information-theoretic criteria for adaptive attention allocation (Zheng, 22 Jan 2026).

1. Architectural Foundations

AMOR operationalizes the principle of dual-process cognition by partitioning sequence processing into two interacting systems:

  1. System 1 (SSM Backbone): Processes each token sequentially through an SSM (such as S4 or GRU), producing hidden states htRdh_t \in \mathbb{R}^d and logits t=WohtRV\ell_t = W_o h_t \in \mathbb{R}^{|V|} at each time step tt. This backbone operates in O(n)O(n) time with respect to sequence length nn, yielding stateful representations with low per-token computational cost.
  2. System 2 (Sparse Attention): Invoked conditionally, only when System 1 expresses high predictive uncertainty, as measured by entropy. Rather than recomputing transformer-style key and value projections, AMOR employs "Ghost KV"—projecting K=WKHK = W_K H and V=WVHV = W_V H directly from the SSM's entire sequence of hidden states H=[h1;;hT]RT×dH = [h_1; \ldots; h_T] \in \mathbb{R}^{T \times d} in O(n)O(n) time. At selected positions, the attention mechanism forms queries as qt=WQhtq_t = W_Q h_t and computes (sparse) top-kk attention over key-value pairs.

This hybrid architecture thus amortizes expensive O(n2)O(n^2) attention across only a subset of positions, retaining SSM efficiency while enabling high-precision retrieval.

2. Entropy-Based Gating Mechanism

Token-level routing to System 2 is determined by an entropy gate derived from the SSM's predictive uncertainty:

  • Predictive Distribution: After SSM processing, the output logits at position tt define probabilities:

pt,i=exp(t,i)jexp(t,j)p_{t,i} = \frac{\exp(\ell_{t,i})}{\sum_j \exp(\ell_{t,j})}

Ht=i=1Vpt,ilogpt,iH_t = -\sum_{i=1}^{|V|} p_{t,i} \log p_{t,i}

Normalized entropy, H^t=Ht/logV\hat{H}_t = H_t / \log |V|, allows consistent thresholding across different vocabularies.

  • Gating Rule: A learnable threshold τ\tau and sharpness parameter α\alpha govern the hard gate:

gt=1[σ(α(H^tτ))>0.5]g_t = \mathbf{1} \Bigl[\sigma \bigl( \alpha(\hat{H}_t - \tau) \bigr ) > 0.5 \Bigr ]

where σ\sigma is the sigmoid function. Alternatively, a simple threshold may be used:

gt={1,Ht>τ 0,Htτg_t = \begin{cases} 1, & H_t > \tau \ 0, & H_t \leq \tau \end{cases}

When gt=1g_t=1, System 2 attention is applied; otherwise, the model relies solely on the SSM output.

3. Algorithmic Workflow

The forward computation in AMOR proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Initialize h
for t in 1T:
    e_t  Embed(x_t)
    h_t  SSM(e_t, h_{t-1})                   # System 1
    ℓ_t  W_o · h_t
    p_t  softmax(ℓ_t)
    H_t  _i p_{t,i}·log p_{t,i}
    Ĥ_t  H_t / log|V|
    g_t  1 if sigmoid(α(Ĥ_tτ))>0.5 else 0   # Entropy Gate

K  W_K · [h;;h_T]
V  W_V · [h;;h_T]
for t in 1T:
    if g_t==1:                                # System 2
        q_t  W_Q·h_t
        scores  (q_t Kᵀ)/d
        sparse_scores  top-k(scores,k=3)
        attn_t  softmax(sparse_scores)·V
    else:
        attn_t  0
    h_combined_t  W_c·[h_t  attn_t]
    output_logits_t  W_out·h_combined_t

The Ghost KV projections efficiently share SSM-derived temporal context. Sparse top-kk attention (typically k=3k=3) is selected to maximize computational savings without degrading retrieval accuracy.

4. Computational Complexity and Efficiency

AMOR achieves substantial computational improvements relative to standard transformers:

  • SSM Backbone: O(nd2)O(n\,d^2), essentially linear in sequence length.
  • Transformer Attention: O(n2d)O(n^2 d) per layer.
  • AMOR: Reuses O(nd2)O(n\,d^2) SSM hidden states for Ghost KV construction, and computes attention only at gate-selected positions. If the gate activates at rate rr:

O(nd2+rn2d)O(n\,d^2 + r\,n^2\,d)

Empirical gate rates (r0.22r \approx 0.22 on synthetic tasks) yield approximately 78% reduction in attention computation relative to transformer baselines.

Model Retrieval Acc (%) Gate Rate (%) Attention Reduction (%)
SSM only 68.4 0 100
Transformer 87.3 100 0
AMOR (entropy) 100 22.3 78

Perfect retrieval is attained with a minority of positions routed to attention.

5. Empirical Evaluation and Benchmarks

AMOR is validated on synthetic long-range retrieval benchmarks:

  • Simple Retrieval Task: Sequences of length 128, combining routine bigram dependencies with intermittent explicit retrieval points (~3%). Metrics include overall next-token and retrieval accuracy, gate rate, and entropy gap between retrieval vs. local positions. AMOR yields 100% retrieval accuracy with only 22.3% gate activation and an entropy gap of 1.09 nats, indicating the effectiveness of entropy as a retrieval signal.
  • NeedleHaystack Task: Insertion of 2–5 key–value pairs followed by 50–150 noise tokens induces SSM state decay; retrieval is then probed at distal positions. With oracle gating, AMOR reaches 37.1% retrieval accuracy (vs. 4.4% for transformers); entropy-based gating achieves 9.9% accuracy (gate rate 81%), illustrating the limits of SSM state retention and the value of attention augmentation.

Ablation studies demonstrate that Ghost KV projections from SSM states (36.3% retrieval) vastly outperform raw embedding-based attention (6.1%), and that sparser attention (k=3k = 3) enhances retrieval focus compared to denser alternatives.

6. Interpretability and Theoretical Insights

The entropy gating mechanism in AMOR offers principled, interpretable adaptive computation:

  • Metacognitive Routing: Entropy, grounded in information theory, quantifies the SSM's uncertainty and serves as a reliable signal for invoking auxiliary processing. Positions requiring retrieval consistently show an entropy gap of 1.09 nats, nearly half the normalized entropy range, distinguishing them from routine, low-uncertainty predictions.
  • Interpretable Computation: The gating process is transparent and admits analytic understanding. When false positives occur (positions erroneously routed to attention), they introduce additional compute without adversely impacting correctness.
  • Systemic View: AMOR instantiates dual-process computation where an O(n)O(n) backbone handles standard context, and a lightweight, information-theoretically justified router triggers O(n2)O(n^2) attention only on demand. This design confers both efficiency gains and rigorously justified decision boundaries for adaptive allocation (Zheng, 22 Jan 2026).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Gated Alternation (AMOR).