Dynamic Gated Alternation (AMOR)
- Dynamic Gated Alternation (AMOR) is a dual-process sequence model that integrates an efficient SSM backbone with sparse attention, activated by high entropy.
- It employs an entropy-based gating mechanism to selectively route tokens, achieving up to a 78% reduction in attention computation compared to standard transformers.
- Empirical evaluations reveal that AMOR attains perfect retrieval accuracy on synthetic tasks while using minimal additional computation for long-range dependencies.
Dynamic Gated Alternation (AMOR), also termed the Adaptive Metacognitive Output Router, is a dual-process sequence modeling architecture that dynamically allocates computation by integrating an efficient State Space Model (SSM) backbone with a sparse attention mechanism. The selection between these computational paths is governed by an entropy-based gating signal, yielding a metacognitive system that routes tokens to more expensive processing only when model uncertainty is high. AMOR is designed to address the inefficiencies of standard transformers, which apply uniform computation at all positions, and mitigate the retrieval limitations of SSMs over long-range dependencies by leveraging information-theoretic criteria for adaptive attention allocation (Zheng, 22 Jan 2026).
1. Architectural Foundations
AMOR operationalizes the principle of dual-process cognition by partitioning sequence processing into two interacting systems:
- System 1 (SSM Backbone): Processes each token sequentially through an SSM (such as S4 or GRU), producing hidden states and logits at each time step . This backbone operates in time with respect to sequence length , yielding stateful representations with low per-token computational cost.
- System 2 (Sparse Attention): Invoked conditionally, only when System 1 expresses high predictive uncertainty, as measured by entropy. Rather than recomputing transformer-style key and value projections, AMOR employs "Ghost KV"—projecting and directly from the SSM's entire sequence of hidden states in time. At selected positions, the attention mechanism forms queries as and computes (sparse) top- attention over key-value pairs.
This hybrid architecture thus amortizes expensive attention across only a subset of positions, retaining SSM efficiency while enabling high-precision retrieval.
2. Entropy-Based Gating Mechanism
Token-level routing to System 2 is determined by an entropy gate derived from the SSM's predictive uncertainty:
- Predictive Distribution: After SSM processing, the output logits at position define probabilities:
- Entropy Calculation: The unnormalized Shannon entropy is
Normalized entropy, , allows consistent thresholding across different vocabularies.
- Gating Rule: A learnable threshold and sharpness parameter govern the hard gate:
where is the sigmoid function. Alternatively, a simple threshold may be used:
When , System 2 attention is applied; otherwise, the model relies solely on the SSM output.
3. Algorithmic Workflow
The forward computation in AMOR proceeds as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Initialize h₀ for t in 1…T: e_t ← Embed(x_t) h_t ← SSM(e_t, h_{t-1}) # System 1 ℓ_t ← W_o · h_t p_t ← softmax(ℓ_t) H_t ← –∑_i p_{t,i}·log p_{t,i} Ĥ_t ← H_t / log|V| g_t ← 1 if sigmoid(α(Ĥ_t–τ))>0.5 else 0 # Entropy Gate K ← W_K · [h₁;…;h_T] V ← W_V · [h₁;…;h_T] for t in 1…T: if g_t==1: # System 2 q_t ← W_Q·h_t scores ← (q_t Kᵀ)/√d sparse_scores ← top-k(scores,k=3) attn_t ← softmax(sparse_scores)·V else: attn_t ← 0 h_combined_t ← W_c·[h_t ∥ attn_t] output_logits_t ← W_out·h_combined_t |
The Ghost KV projections efficiently share SSM-derived temporal context. Sparse top- attention (typically ) is selected to maximize computational savings without degrading retrieval accuracy.
4. Computational Complexity and Efficiency
AMOR achieves substantial computational improvements relative to standard transformers:
- SSM Backbone: , essentially linear in sequence length.
- Transformer Attention: per layer.
- AMOR: Reuses SSM hidden states for Ghost KV construction, and computes attention only at gate-selected positions. If the gate activates at rate :
Empirical gate rates ( on synthetic tasks) yield approximately 78% reduction in attention computation relative to transformer baselines.
| Model | Retrieval Acc (%) | Gate Rate (%) | Attention Reduction (%) |
|---|---|---|---|
| SSM only | 68.4 | 0 | 100 |
| Transformer | 87.3 | 100 | 0 |
| AMOR (entropy) | 100 | 22.3 | 78 |
Perfect retrieval is attained with a minority of positions routed to attention.
5. Empirical Evaluation and Benchmarks
AMOR is validated on synthetic long-range retrieval benchmarks:
- Simple Retrieval Task: Sequences of length 128, combining routine bigram dependencies with intermittent explicit retrieval points (~3%). Metrics include overall next-token and retrieval accuracy, gate rate, and entropy gap between retrieval vs. local positions. AMOR yields 100% retrieval accuracy with only 22.3% gate activation and an entropy gap of 1.09 nats, indicating the effectiveness of entropy as a retrieval signal.
- NeedleHaystack Task: Insertion of 2–5 key–value pairs followed by 50–150 noise tokens induces SSM state decay; retrieval is then probed at distal positions. With oracle gating, AMOR reaches 37.1% retrieval accuracy (vs. 4.4% for transformers); entropy-based gating achieves 9.9% accuracy (gate rate 81%), illustrating the limits of SSM state retention and the value of attention augmentation.
Ablation studies demonstrate that Ghost KV projections from SSM states (36.3% retrieval) vastly outperform raw embedding-based attention (6.1%), and that sparser attention () enhances retrieval focus compared to denser alternatives.
6. Interpretability and Theoretical Insights
The entropy gating mechanism in AMOR offers principled, interpretable adaptive computation:
- Metacognitive Routing: Entropy, grounded in information theory, quantifies the SSM's uncertainty and serves as a reliable signal for invoking auxiliary processing. Positions requiring retrieval consistently show an entropy gap of 1.09 nats, nearly half the normalized entropy range, distinguishing them from routine, low-uncertainty predictions.
- Interpretable Computation: The gating process is transparent and admits analytic understanding. When false positives occur (positions erroneously routed to attention), they introduce additional compute without adversely impacting correctness.
- Systemic View: AMOR instantiates dual-process computation where an backbone handles standard context, and a lightweight, information-theoretically justified router triggers attention only on demand. This design confers both efficiency gains and rigorously justified decision boundaries for adaptive allocation (Zheng, 22 Jan 2026).