Dynamic Gated Alternation (AMOR)

Updated 28 February 2026

Dynamic Gated Alternation (AMOR) is a dual-process sequence model that integrates an efficient SSM backbone with sparse attention, activated by high entropy.
It employs an entropy-based gating mechanism to selectively route tokens, achieving up to a 78% reduction in attention computation compared to standard transformers.
Empirical evaluations reveal that AMOR attains perfect retrieval accuracy on synthetic tasks while using minimal additional computation for long-range dependencies.

Dynamic Gated Alternation (AMOR), also termed the Adaptive Metacognitive Output Router, is a dual-process sequence modeling architecture that dynamically allocates computation by integrating an efficient State Space Model (SSM) backbone with a sparse attention mechanism. The selection between these computational paths is governed by an entropy-based gating signal, yielding a metacognitive system that routes tokens to more expensive processing only when model uncertainty is high. AMOR is designed to address the inefficiencies of standard transformers, which apply uniform computation at all positions, and mitigate the retrieval limitations of SSMs over long-range dependencies by leveraging information-theoretic criteria for adaptive attention allocation (Zheng, 22 Jan 2026).

1. Architectural Foundations

AMOR operationalizes the principle of dual-process cognition by partitioning sequence processing into two interacting systems:

System 1 (SSM Backbone): Processes each token sequentially through an SSM (such as S4 or GRU), producing hidden states $h_t \in \mathbb{R}^d$ and logits $\ell_t = W_o h_t \in \mathbb{R}^{|V|}$ at each time step $t$ . This backbone operates in $O(n)$ time with respect to sequence length $n$ , yielding stateful representations with low per-token computational cost.
System 2 (Sparse Attention): Invoked conditionally, only when System 1 expresses high predictive uncertainty, as measured by entropy. Rather than recomputing transformer-style key and value projections, AMOR employs "Ghost KV"—projecting $K = W_K H$ and $V = W_V H$ directly from the SSM's entire sequence of hidden states $H = [h_1; \ldots; h_T] \in \mathbb{R}^{T \times d}$ in $O(n)$ time. At selected positions, the attention mechanism forms queries as $q_t = W_Q h_t$ and computes (sparse) top- $k$ attention over key-value pairs.

This hybrid architecture thus amortizes expensive $O(n^2)$ attention across only a subset of positions, retaining SSM efficiency while enabling high-precision retrieval.

2. Entropy-Based Gating Mechanism

Token-level routing to System 2 is determined by an entropy gate derived from the SSM's predictive uncertainty:

Predictive Distribution: After SSM processing, the output logits at position $t$ define probabilities:

$p_{t,i} = \frac{\exp(\ell_{t,i})}{\sum_j \exp(\ell_{t,j})}$

Entropy Calculation: The unnormalized Shannon entropy is

$H_t = -\sum_{i=1}^{|V|} p_{t,i} \log p_{t,i}$

Normalized entropy, $\hat{H}_t = H_t / \log |V|$ , allows consistent thresholding across different vocabularies.

Gating Rule: A learnable threshold $\tau$ and sharpness parameter $\alpha$ govern the hard gate:

$g_t = \mathbf{1} \Bigl[\sigma \bigl( \alpha(\hat{H}_t - \tau) \bigr ) > 0.5 \Bigr ]$

where $\sigma$ is the sigmoid function. Alternatively, a simple threshold may be used:

$g_t = \begin{cases} 1, & H_t > \tau \ 0, & H_t \leq \tau \end{cases}$

When $g_t=1$ , System 2 attention is applied; otherwise, the model relies solely on the SSM output.

3. Algorithmic Workflow

The forward computation in AMOR proceeds as follows:

Initialize h₀
for t in 1…T:
    e_t ← Embed(x_t)
    h_t ← SSM(e_t, h_{t-1})                   # System 1
    ℓ_t ← W_o · h_t
    p_t ← softmax(ℓ_t)
    H_t ← –∑_i p_{t,i}·log p_{t,i}
    Ĥ_t ← H_t / log|V|
    g_t ← 1 if sigmoid(α(Ĥ_t–τ))>0.5 else 0   # Entropy Gate

K ← W_K · [h₁;…;h_T]
V ← W_V · [h₁;…;h_T]
for t in 1…T:
    if g_t==1:                                # System 2
        q_t ← W_Q·h_t
        scores ← (q_t Kᵀ)/√d
        sparse_scores ← top-k(scores,k=3)
        attn_t ← softmax(sparse_scores)·V
    else:
        attn_t ← 0
    h_combined_t ← W_c·[h_t ∥ attn_t]
    output_logits_t ← W_out·h_combined_t

The Ghost KV projections efficiently share SSM-derived temporal context. Sparse top- $k$ attention (typically $k=3$ ) is selected to maximize computational savings without degrading retrieval accuracy.

4. Computational Complexity and Efficiency

AMOR achieves substantial computational improvements relative to standard transformers:

SSM Backbone: $O(n\,d^2)$ , essentially linear in sequence length.
Transformer Attention: $O(n^2 d)$ per layer.
AMOR: Reuses $O(n\,d^2)$ SSM hidden states for Ghost KV construction, and computes attention only at gate-selected positions. If the gate activates at rate $r$ :

$O(n\,d^2 + r\,n^2\,d)$

Empirical gate rates ( $r \approx 0.22$ on synthetic tasks) yield approximately 78% reduction in attention computation relative to transformer baselines.

Model	Retrieval Acc (%)	Gate Rate (%)	Attention Reduction (%)
SSM only	68.4	0	100
Transformer	87.3	100	0
AMOR (entropy)	100	22.3	78

Perfect retrieval is attained with a minority of positions routed to attention.

5. Empirical Evaluation and Benchmarks

AMOR is validated on synthetic long-range retrieval benchmarks:

Simple Retrieval Task: Sequences of length 128, combining routine bigram dependencies with intermittent explicit retrieval points (~3%). Metrics include overall next-token and retrieval accuracy, gate rate, and entropy gap between retrieval vs. local positions. AMOR yields 100% retrieval accuracy with only 22.3% gate activation and an entropy gap of 1.09 nats, indicating the effectiveness of entropy as a retrieval signal.
NeedleHaystack Task: Insertion of 2–5 key–value pairs followed by 50–150 noise tokens induces SSM state decay; retrieval is then probed at distal positions. With oracle gating, AMOR reaches 37.1% retrieval accuracy (vs. 4.4% for transformers); entropy-based gating achieves 9.9% accuracy (gate rate 81%), illustrating the limits of SSM state retention and the value of attention augmentation.

Ablation studies demonstrate that Ghost KV projections from SSM states (36.3% retrieval) vastly outperform raw embedding-based attention (6.1%), and that sparser attention ( $k = 3$ ) enhances retrieval focus compared to denser alternatives.

6. Interpretability and Theoretical Insights

The entropy gating mechanism in AMOR offers principled, interpretable adaptive computation:

Metacognitive Routing: Entropy, grounded in information theory, quantifies the SSM's uncertainty and serves as a reliable signal for invoking auxiliary processing. Positions requiring retrieval consistently show an entropy gap of 1.09 nats, nearly half the normalized entropy range, distinguishing them from routine, low-uncertainty predictions.
Interpretable Computation: The gating process is transparent and admits analytic understanding. When false positives occur (positions erroneously routed to attention), they introduce additional compute without adversely impacting correctness.
Systemic View: AMOR instantiates dual-process computation where an $O(n)$ backbone handles standard context, and a lightweight, information-theoretically justified router triggers $O(n^2)$ attention only on demand. This design confers both efficiency gains and rigorously justified decision boundaries for adaptive allocation (Zheng, 22 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

When to Think Fast and Slow? AMOR: Entropy-Based Metacognitive Gate for Dynamic SSM-Attention Switching (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Gated Alternation (AMOR).