Papers
Topics
Authors
Recent
Search
2000 character limit reached

Constant Memory Attention Block

Updated 5 April 2026
  • Constant Memory Attention Block is a mechanism that compresses arbitrarily large inputs into fixed-size representations, ensuring scalable context aggregation.
  • It uses cross-attention and self-attention with learned bottleneck queries to maintain constant auxiliary memory use and support O(1)-time streaming updates.
  • This approach has been applied successfully in meta-learning, video generation, and language modeling, achieving competitive performance with minimal memory overhead.

A Constant Memory Attention Block (CMAB) is an attention mechanism designed to enable large-scale context aggregation with provably constant auxiliary memory requirements, independent of the number of input tokens or context elements. Originally introduced in the domain of meta-learning for neural processes, CMABs and related constant-memory block attention structures have since expanded to diverse architectures and practical settings, including temporal point processes, efficient video generation, block-wise long-sequence language modeling, and photonic LLM accelerators. The unifying characteristic is the ability to compress arbitrarily large input sets into a fixed-size latent, memory, or state representation that supports constant-memory forward computation and O(1)-time streaming/dynamic updates.

1. Formal Definitions and Structural Properties

CMABs operate by factoring attention across (potentially large) input sets or sequences using fixed-size learned embeddings, latent slots, or external memory modules. In the canonical form (Feng et al., 2023, Feng et al., 2023), the block compresses N input tokens x1,,xNx_1, \ldots, x_N into LBL_B “bottleneck” queries (learned vectors) through cross-attention: CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT}) with CA\text{CA} denoting cross-attention. This is followed by self-attention over the compressed CBC_B, then cross-attention from a fixed LI×dL_I\times d (input latent) set to CBC_B and another self-attention, yielding the block’s output: CMAB(IEMB,INPUT)=SA(CA(IEMB,SA(CA(BEMB,INPUT)))).\mathrm{CMAB}(\mathrm{IEMB}, \mathrm{INPUT}) = \mathrm{SA}( \mathrm{CA}(\mathrm{IEMB}, \mathrm{SA}( \mathrm{CA}(\mathrm{BEMB}, \mathrm{INPUT}))))\,. The memory cost beyond the raw input storage is dominated by fixed objects of sizes LBL_B, LIL_I (LBL_B0 in N), and chunked processing keeps all intermediate matrices bounded (Feng et al., 2023). Streaming updates are supported by keeping per-bottleneck accumulators for softmax normalizers and value-weighted sums, updatable in O(1) with each new datum.

Alternative structural paradigms include block-wise autoregressive linear attention with constant-memory key-value (KV) summaries (Chen et al., 29 Sep 2025), fixed-size external learnable memory “packing/unpacking” (Yorsh et al., 2024), and block self-attention hierarchies (Shen et al., 2018).

2. Computational Complexity and Memory Analysis

CMABs enforce strict upper bounds on peak workspace. For the canonical set attention block (Feng et al., 2023):

  • Forward pass: LBL_B1 time and LBL_B2 active memory; all cross-attention with the context is executed in chunks or via rolled accumulators.
  • Streaming update: Each new input requires only LBL_B3 work and storage, as only accumulators are updated.
  • For model stacks (e.g., in neural processes), memory does not grow with the size of context or target sets, as all further attentions are applied to constant-size latents.

Block-wise linear attention structures achieve true constant-memory scaling by maintaining only block-wise compressed summaries. In SANA-Video, cumulative kernelized key-value matrices LBL_B4 and LBL_B5 are sufficient; per-block updates and computations remain LBL_B6 regardless of total sequence/token count (Chen et al., 29 Sep 2025).

Memory-augmented approaches as in ConvLuna restrict the token-to-memory interaction to LBL_B7 (for LBL_B8 inputs, LBL_B9 memory slots), and CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})0 is fixed; thus peak attention matrix storage, as well as the total synthesized feature map, are constant in CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})1 (Yorsh et al., 2024). Block self-attention, as in Bi-BloSAN, splits into intra-block and inter-block attention, confining high-memory operations within blocks of size CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})2 or across CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})3 blocks; memory becomes CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})4, minimized for CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})5 (Shen et al., 2018).

3. Algorithms and Streaming Update Mechanisms

The core computational primitive is an accumulator-based approach to attention. For the set-to-bottleneck cross-attention,

CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})6

and CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})7. On streaming updates, arrival of CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})8 allows immediate computation of the update for every CB=CA(BEMB,INPUT)C_B = \text{CA}(\text{BEMB}, \text{INPUT})9: CA\text{CA}0 and correspondingly the output CA\text{CA}1 can be updated with simple arithmetic (Feng et al., 2023, Feng et al., 2023). These formulas can be batched or stabilized in log space with no loss of O(1)-per-step guarantee.

For blockwise linear attention, constant-memory is enforced via kernel tricks: all of history up to the current block is compressed in CA\text{CA}2 and CA\text{CA}3, each block processes local tokens, then merges its summary into the cache (Chen et al., 29 Sep 2025). Lightning Attention (Qin et al., 2024) employs intra-block softmax attention (masking remains CA\text{CA}4) and inter-block kernel-accumulated updates, guaranteeing no cumulative summation buffer ever scales with CA\text{CA}5.

Photonic block-selection as in PRISM (Park et al., 23 Mar 2026) eschews memory-bound electronic searching for O(1) “broadcast-and-weight” inner-product computation: all block similarities to a query are evaluated simultaneously in hardware, so (electronic) memory usage for KV selection is strictly bounded regardless of total context length.

4. Applications and Benchmarks

Meta-learning and Neural Processes

CMABs have been integrated into Neural Processes (specifically in Constant Memory Attentive Neural Processes, CMANPs), Temporal Point Processes (CMHP), and general set-based regression, producing memory usage flat versus CA\text{CA}6 and outperforming or matching prior approaches on image completion (CelebA, EMNIST) and event prediction tasks (Feng et al., 2023, Feng et al., 2023). For example, on CelebA128, CMANP achieved 5.55±0.01 log-likelihood with CA\text{CA}71MB GPU memory, while all quadratic/linear-memory baselines OOM at this scale.

Video Generation

In SANA-Video, block linear diffusion with a constant-memory KV cache enables efficient autoregressive video synthesis with minute-scale temporal horizons, O(1) memory, and real-time throughput on commodity RTX 5090 GPUs (Chen et al., 29 Sep 2025). Compared to state-of-the-art models (e.g., Wan 2.1/SkyReel-V2-1.3B), SANA-Video achieves 16× lower latency and maintains competitive generation quality.

Language Modeling and Blockwise LLM

Lightning Attention (Qin et al., 2024) demonstrates constant SRAM memory for all block-wise linear attention at any sequence length, with measured GPU memory at n=64K fixed at 10GB (compared to 160GB for FlashAttention softmax), with little loss in token-level throughput or downstream perplexity. Photonic KV block-selection architectures (Park et al., 23 Mar 2026) reduce memory traffic by >16× and energy by 3–4 orders of magnitude relative to conventional GPU implementations at large context lengths (up to 128K tokens).

Block Attention for Efficient Sequence Models

Bi-BloSAN (Shen et al., 2018) shows high-accuracy, memory-efficient sequence encoding: intra-block and inter-block self-attention constructions sustain task performance on standard NLP benchmarks with drastically reduced memory and computation compared to full self-attention.

Empirical results across these domains consistently demonstrate that constant-memory block attention mechanisms enable tractable scaling to large input sets, outperform or match higher-memory baselines, and are particularly well-suited for resource-constrained or latency-critical deployments (Feng et al., 2023, Chen et al., 29 Sep 2025, Qin et al., 2024).

5. Design Choices, Limitations, and Trade-offs

Key design choices include bottleneck/slot count (in CMAB), block size (in blockwise structures), and the filtering operator for input compression (kernel pooling or convolution in ConvLuna (Yorsh et al., 2024)). Increasing bottleneck size improves representational power but increases per-block memory cost. Filtering (e.g., max-pool, Conv1D) before memory writes prevents memory collapse and keeps prototype utilization—even CA\text{CA}8 outperforms a full Transformer on some long-range tasks (Yorsh et al., 2024).

Streaming updates rely on numerical stability: log-sum-exp machinery avoids catastrophic overflow, but in very long runs, drift can accumulate, so periodic batch recomputation may be necessary (Feng et al., 2023). CMABs, like all compressed-set techniques, can lose fine-grained pairwise dependencies—hierarchical or composite mixtures could potentially alleviate this at fixed cost.

Hardware constraints for photonic selection (PRISM) include limited microring (MRR) count and wavelength-division multiplexing channel bandwidth, though practical multi-chip and time-multiplexed configurations suffice for current model sizes (Park et al., 23 Mar 2026).

Limitations include:

  • Not all global dependencies or subtle input interactions are preserved in the compressive representation; tasks requiring full fidelity may see information loss.
  • For blockwise methods, block size and stride impact both local/global trade-off and hardware efficiency.
  • Pure residual memory updates risk slot collapse or under-utilization; exploration of gated or selective update mechanisms is ongoing (Yorsh et al., 2024).

Constant memory attention blocks have generalized rapidly from meta-learning to efficient sequence modeling, long-context LLM inference, video generation, and hardware accelerators. Distinct instantiations include:

Open directions include adaptive bottleneck sizing, integrating sparse/low-rank/factorized attention with constant-memory affordances, improving memory update rules (dynamic, gated, selective), and extending current block attention mechanisms to cross-modal and generative scenarios beyond classification and regression (Feng et al., 2023, Chen et al., 29 Sep 2025, Yorsh et al., 2024).

By compressing all history into fixed-size statistics and modularizing computation and updates blockwise or via external memory, constant-memory attention blocks enable scalable, efficient, and streaming-capable architectures for long-context and on-device applications, while sustaining competitive empirical performance across modalities and paradigms.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Constant Memory Attention Block.