ATMA: Length-Invariant Language Modeling via Polar Attention and Gated-Delta Compression Memory

Published 23 Jun 2026 in cs.LG and cs.AI | (2606.25156v1)

Abstract: Modern LLMs based on softmax scaled-dot-product attention are constrained by their training sequence length: as the key-value sequence grows, softmax probability mass can dilute across a wider distribution, inducing activation shift and long-context performance collapse. Moreover, long-context language modeling faces a structural tension: a sliding-window attention core maintains a bounded local representation and low perplexity but is blind to long-range dependencies, while full-context attention preserves global recall but suffers from out-of-distribution perplexity explosion. To resolve these limitations, we introduce ATMA, a hybrid convolutional-attention architecture that integrates a novel three-channel attention mechanism. ATMA factorizes the attention mixing step into: (1) a count-blind, unit-vector direction channel, (2) a bounded magnitude channel driven by the participation ratio of effective matches over an extreme-value-corrected null sink, and (3) a long-term recurrent compression memory optimized via a gated-delta fast-weights rule. Neither the Polar Attention core nor the recurrent memory is sufficient alone; their combination enables monotonic perplexity reduction and high-fidelity long-range retrieval simultaneously. We evaluate ATMA using a 100-run factorial ablation sweep, demonstrating that the combined Polar + memory model maintains induction needle-in-a-haystack retrieval accuracy above 90% out to 64K tokens (32 times the training length of 2K) while its document perplexity improves monotonically, outperforming softmax-based memory baselines which collapse at extreme context lengths. Code: https://github.com/kreasof-ai/atma

Abstract PDF Upgrade to Chat

Authors (1)

Habibullah Akbar

Summary

The paper introduces ATMA, combining polar attention with gated-delta memory to maintain over 90% retrieval accuracy at extended context lengths.
It utilizes a hybrid 3:1 convolution-attention design with key normalization and optimized kernels to prevent softmax dilution and stabilize activations.
Empirical evaluation demonstrates that integrating memory with length-invariant attention allows reliable performance for >30× context extrapolation with improved perplexity.

Length-Invariant Sequence Modeling with ATMA: Architecture, Mechanisms, and Empirical Analysis

Motivation and Problem Statement

The intrinsic limitations of softmax scaled-dot-product attention (SDPA) in the context of long-sequence language modeling have become increasingly evident as practitioners push context lengths to tens of thousands of tokens. The primary phenomenon—termed “softmax dilution”—manifests as probability mass dispersion when the number of key-value pairs increases. Consequently, the maximum coordinate of the softmax output rapidly approaches zero, which both flattens attention and corrupts activations, leading to catastrophic degradation of retrieval accuracy and perplexity when extrapolating beyond the training sequence length. Neither pure local-mixing (e.g., sliding-window attention) nor pure global-mixing (unbounded attention or recurrent fast-weights) independently resolve this trade-off: sliding-window models cannot retrieve long-range information, while memory-augmented or linear-state approaches experience substantial representational drift and sharp collapses in retrieval fidelity.

ATMA Model Architecture

ATMA proposes a hybrid, convolution-attention sequence model synthesizing convolutional and attention-based mechanisms with a memory-augmented channel. The architecture integrates 12 LFM2 gated convolutional layers and 4 Polar Attention layers within a 16-layer decoder-only transformer backbone.

Key aspects of the architecture:

Hybrid Mixer Design: A 3:1 ratio (convolution:attention) allows global context mixing while maintaining computational efficiency and local sequence coherence. Pre-normalization (RMSNorm-based) is consistently applied before all sequence-mixing and MLP operations.
MLP Design: Employs squared-ReLU activations with 4× hidden size expansion for parameter efficiency.
Grouped-Query Attention (GQA): Enhances throughput and cache efficiency at large batch sizes.

Polar Attention: Mechanistic Innovations

Polar Attention replaces softmax attention with a mechanism that explicitly factorizes the sequence mixing into three orthogonal components:

Direction Channel ("What"): The value sum, after softmax normalization (augmented with a trainable null sink), is projected to the unit sphere, creating a count-blind and strictly length-invariant direction vector (feature selection is decoupled from match count).
Magnitude Channel ("How much"): Participation ratio (inverse Simpson index) measures the "density" of attention matches and is passed through a saturating monotonic map, bounding the magnitude and guaranteeing statistical stability across context scales.
Null Sink and Length Temperature: The virtual null key’s logit and softmax temperature scale as a function of sequence length, adapting the attention mechanism to progressive context extension and controlling the impact of random noise.
Auxiliary Distractor Loss: Optionally calibrates the null floor, but ablation results indicate this loss is detrimental when memory is enabled, due to excessive null sharpness.

Titans Gated-Delta Memory Integration

The memory channel is implemented as a per-head, recurrent associative memory inspired by the Titans fast-weight model. Key features include:

Gated-Delta Update Rule: Maintains memory stability via sequence-normalized keys/queries and a rank-1 update after each step, avoiding the exponential norm blow-up of standard Hebbian or unnormalized recurrence.
Recency Control: A per-step, data-dependent gating mechanism determines retention and write strengths.
Self-Inclusive Readout: Memory readout integrates with the residual stream, initialized for safety to act as an effective no-op at training start.

Empirical analysis demonstrates that key normalization is essential: omitting it leads to immediate divergence, while including it guarantees norm stability and preserves usable retrieval capacity over context extension.

Systems and Kernel Engineering

A significant element of ATMA is the careful co-design of model architecture and system kernels:

Fused Triton Kernels: FlashAttention-style streaming kernels minimize memory overhead, enabling online computation of participation ratio in O(BLOCK) memory with exact backward-pass accumulation.
Paged Decode Optimization: Sequence-blocked KV cache loads permit group-wise streaming of keys/values, virtually eliminating unnecessary HBM traffic.
In-Place Memory Update: A custom Triton kernel ensures in-place, contiguous updates of the large per-sequence memory state, tripling update efficiency over naive gather/scatter approaches.

Despite the recurrence dependencies, the implementation minimizes MFU (model flops utilization) overhead to a 4.5 percentage-point drop compared to Polar Attention alone.

Empirical Results and Ablation Study

A comprehensive 100-run factorial sweep evaluated models on FineWeb-Edu (1B tokens, 370M params) with context lengths from 2K (train) to 64K (eval, 32× extrapolation). Key evaluations included clean-document perplexity and induction needle-in-a-haystack (NIAH) retrieval.

Main empirical findings:

Full-context attention alone (softmax or polar) fails catastrophically beyond 4× context extension; neither retrieval nor perplexity generalizes.
Titans memory with softmax recovers perplexity but collapses in retrieval accuracy at extreme contexts due to representation drift.
ATMA (Polar+Titans) uniquely maintains >90% retrieval accuracy up to 64K tokens with monotonically decreasing perplexity (from 2.70 to 1.96 nats, outperforming softmax-based baselines).
Memory is mandatory for length extrapolation, but must be paired with a length-invariant attention core; neither component works in isolation.
Sliding-window or distractor loss is deleterious in the presence of the memory channel, reducing attainable retrieval by removing access to long-range targets and over-sharpening the null sink.

Theoretical and Practical Implications

Theoretical

ATMA's decomposition of sequence-mixing into orthogonal channels elegantly resolves the tension between global context recall and stable local representation in deep sequence models. The explicit statistical bounding of each channel limits out-of-distribution activation drift and catastrophic interference common to both full softmax and standard linear attention architectures. By demonstrating the necessity of joint attention-memory approaches for true length-invariance, the work reframes the retrieval-vs-perplexity debate in long-context sequence modeling.

Practical

The model is robust under >30× context extrapolation (trained at 2K, deployed up to 64K), with minimal computational overhead and numerically stable inference. This makes it suitable for large-scale codebases, legal/financial document modeling, and any workload demanding precise, long-range information retrieval. The optimized kernels ensure throughput is not the primary bottleneck, highlighting the relevance of algorithmic-hardware co-design in real-world LLM deployments.

Future Directions

The results suggest new design axes for future LLMs—including the exploration of alternative memory update functions, enhanced write-path distractors for memory capacity, and more granular regularization strategies. The architecture provides a foundation for models that can reliably scale to even larger contexts or operate in continual learning scenarios where stable, non-forgetting representation is paramount.

Conclusion

ATMA addresses the fundamental limitations of softmax-based and windowed attention mechanisms by introducing a hybrid polar attention and gated-delta memory architecture. This combination ensures length-invariant sequence modeling, enabling both efficient local and high-fidelity global context integration. Empirical results robustly support the theoretical design, showing monotonic improvements in perplexity and flat retrieval accuracy far beyond the training context, at competitive hardware efficiency. This architecture marks a significant advance in practical and theoretically sound long-context language modeling (2606.25156).

Markdown Report Issue