Constant Memory Attention Block
- Constant Memory Attention Block is a mechanism that compresses arbitrarily large inputs into fixed-size representations, ensuring scalable context aggregation.
- It uses cross-attention and self-attention with learned bottleneck queries to maintain constant auxiliary memory use and support O(1)-time streaming updates.
- This approach has been applied successfully in meta-learning, video generation, and language modeling, achieving competitive performance with minimal memory overhead.
A Constant Memory Attention Block (CMAB) is an attention mechanism designed to enable large-scale context aggregation with provably constant auxiliary memory requirements, independent of the number of input tokens or context elements. Originally introduced in the domain of meta-learning for neural processes, CMABs and related constant-memory block attention structures have since expanded to diverse architectures and practical settings, including temporal point processes, efficient video generation, block-wise long-sequence language modeling, and photonic LLM accelerators. The unifying characteristic is the ability to compress arbitrarily large input sets into a fixed-size latent, memory, or state representation that supports constant-memory forward computation and O(1)-time streaming/dynamic updates.
1. Formal Definitions and Structural Properties
CMABs operate by factoring attention across (potentially large) input sets or sequences using fixed-size learned embeddings, latent slots, or external memory modules. In the canonical form (Feng et al., 2023, Feng et al., 2023), the block compresses N input tokens into “bottleneck” queries (learned vectors) through cross-attention: with denoting cross-attention. This is followed by self-attention over the compressed , then cross-attention from a fixed (input latent) set to and another self-attention, yielding the block’s output: The memory cost beyond the raw input storage is dominated by fixed objects of sizes , (0 in N), and chunked processing keeps all intermediate matrices bounded (Feng et al., 2023). Streaming updates are supported by keeping per-bottleneck accumulators for softmax normalizers and value-weighted sums, updatable in O(1) with each new datum.
Alternative structural paradigms include block-wise autoregressive linear attention with constant-memory key-value (KV) summaries (Chen et al., 29 Sep 2025), fixed-size external learnable memory “packing/unpacking” (Yorsh et al., 2024), and block self-attention hierarchies (Shen et al., 2018).
2. Computational Complexity and Memory Analysis
CMABs enforce strict upper bounds on peak workspace. For the canonical set attention block (Feng et al., 2023):
- Forward pass: 1 time and 2 active memory; all cross-attention with the context is executed in chunks or via rolled accumulators.
- Streaming update: Each new input requires only 3 work and storage, as only accumulators are updated.
- For model stacks (e.g., in neural processes), memory does not grow with the size of context or target sets, as all further attentions are applied to constant-size latents.
Block-wise linear attention structures achieve true constant-memory scaling by maintaining only block-wise compressed summaries. In SANA-Video, cumulative kernelized key-value matrices 4 and 5 are sufficient; per-block updates and computations remain 6 regardless of total sequence/token count (Chen et al., 29 Sep 2025).
Memory-augmented approaches as in ConvLuna restrict the token-to-memory interaction to 7 (for 8 inputs, 9 memory slots), and 0 is fixed; thus peak attention matrix storage, as well as the total synthesized feature map, are constant in 1 (Yorsh et al., 2024). Block self-attention, as in Bi-BloSAN, splits into intra-block and inter-block attention, confining high-memory operations within blocks of size 2 or across 3 blocks; memory becomes 4, minimized for 5 (Shen et al., 2018).
3. Algorithms and Streaming Update Mechanisms
The core computational primitive is an accumulator-based approach to attention. For the set-to-bottleneck cross-attention,
6
and 7. On streaming updates, arrival of 8 allows immediate computation of the update for every 9: 0 and correspondingly the output 1 can be updated with simple arithmetic (Feng et al., 2023, Feng et al., 2023). These formulas can be batched or stabilized in log space with no loss of O(1)-per-step guarantee.
For blockwise linear attention, constant-memory is enforced via kernel tricks: all of history up to the current block is compressed in 2 and 3, each block processes local tokens, then merges its summary into the cache (Chen et al., 29 Sep 2025). Lightning Attention (Qin et al., 2024) employs intra-block softmax attention (masking remains 4) and inter-block kernel-accumulated updates, guaranteeing no cumulative summation buffer ever scales with 5.
Photonic block-selection as in PRISM (Park et al., 23 Mar 2026) eschews memory-bound electronic searching for O(1) “broadcast-and-weight” inner-product computation: all block similarities to a query are evaluated simultaneously in hardware, so (electronic) memory usage for KV selection is strictly bounded regardless of total context length.
4. Applications and Benchmarks
Meta-learning and Neural Processes
CMABs have been integrated into Neural Processes (specifically in Constant Memory Attentive Neural Processes, CMANPs), Temporal Point Processes (CMHP), and general set-based regression, producing memory usage flat versus 6 and outperforming or matching prior approaches on image completion (CelebA, EMNIST) and event prediction tasks (Feng et al., 2023, Feng et al., 2023). For example, on CelebA128, CMANP achieved 5.55±0.01 log-likelihood with 71MB GPU memory, while all quadratic/linear-memory baselines OOM at this scale.
Video Generation
In SANA-Video, block linear diffusion with a constant-memory KV cache enables efficient autoregressive video synthesis with minute-scale temporal horizons, O(1) memory, and real-time throughput on commodity RTX 5090 GPUs (Chen et al., 29 Sep 2025). Compared to state-of-the-art models (e.g., Wan 2.1/SkyReel-V2-1.3B), SANA-Video achieves 16× lower latency and maintains competitive generation quality.
Language Modeling and Blockwise LLM
Lightning Attention (Qin et al., 2024) demonstrates constant SRAM memory for all block-wise linear attention at any sequence length, with measured GPU memory at n=64K fixed at 10GB (compared to 160GB for FlashAttention softmax), with little loss in token-level throughput or downstream perplexity. Photonic KV block-selection architectures (Park et al., 23 Mar 2026) reduce memory traffic by >16× and energy by 3–4 orders of magnitude relative to conventional GPU implementations at large context lengths (up to 128K tokens).
Block Attention for Efficient Sequence Models
Bi-BloSAN (Shen et al., 2018) shows high-accuracy, memory-efficient sequence encoding: intra-block and inter-block self-attention constructions sustain task performance on standard NLP benchmarks with drastically reduced memory and computation compared to full self-attention.
Empirical results across these domains consistently demonstrate that constant-memory block attention mechanisms enable tractable scaling to large input sets, outperform or match higher-memory baselines, and are particularly well-suited for resource-constrained or latency-critical deployments (Feng et al., 2023, Chen et al., 29 Sep 2025, Qin et al., 2024).
5. Design Choices, Limitations, and Trade-offs
Key design choices include bottleneck/slot count (in CMAB), block size (in blockwise structures), and the filtering operator for input compression (kernel pooling or convolution in ConvLuna (Yorsh et al., 2024)). Increasing bottleneck size improves representational power but increases per-block memory cost. Filtering (e.g., max-pool, Conv1D) before memory writes prevents memory collapse and keeps prototype utilization—even 8 outperforms a full Transformer on some long-range tasks (Yorsh et al., 2024).
Streaming updates rely on numerical stability: log-sum-exp machinery avoids catastrophic overflow, but in very long runs, drift can accumulate, so periodic batch recomputation may be necessary (Feng et al., 2023). CMABs, like all compressed-set techniques, can lose fine-grained pairwise dependencies—hierarchical or composite mixtures could potentially alleviate this at fixed cost.
Hardware constraints for photonic selection (PRISM) include limited microring (MRR) count and wavelength-division multiplexing channel bandwidth, though practical multi-chip and time-multiplexed configurations suffice for current model sizes (Park et al., 23 Mar 2026).
Limitations include:
- Not all global dependencies or subtle input interactions are preserved in the compressive representation; tasks requiring full fidelity may see information loss.
- For blockwise methods, block size and stride impact both local/global trade-off and hardware efficiency.
- Pure residual memory updates risk slot collapse or under-utilization; exploration of gated or selective update mechanisms is ongoing (Yorsh et al., 2024).
6. Broader Trends, Variants, and Outlook
Constant memory attention blocks have generalized rapidly from meta-learning to efficient sequence modeling, long-context LLM inference, video generation, and hardware accelerators. Distinct instantiations include:
- Cross-attention with fixed-size latent queries (CMAB) (Feng et al., 2023)
- Hierarchical block self-attention (Bi-BloSAN) (Shen et al., 2018)
- Kernel-based block linear attention with constant-memory accumulators (SANA-Video, Lightning Attention) (Chen et al., 29 Sep 2025, Qin et al., 2024)
- External learnable shared memory with filtering and packing-unpacking (ConvLuna) (Yorsh et al., 2024)
- Photonic block-similarity engines for KV selection (PRISM) (Park et al., 23 Mar 2026)
Open directions include adaptive bottleneck sizing, integrating sparse/low-rank/factorized attention with constant-memory affordances, improving memory update rules (dynamic, gated, selective), and extending current block attention mechanisms to cross-modal and generative scenarios beyond classification and regression (Feng et al., 2023, Chen et al., 29 Sep 2025, Yorsh et al., 2024).
By compressing all history into fixed-size statistics and modularizing computation and updates blockwise or via external memory, constant-memory attention blocks enable scalable, efficient, and streaming-capable architectures for long-context and on-device applications, while sustaining competitive empirical performance across modalities and paradigms.