Sliding Chunk Attention (SCA)
- Sliding Chunk Attention (SCA) is an efficient attention mechanism that partitions sequences into overlapping chunks to balance local context and computational efficiency.
- It minimizes boundary artifacts by overlapping chunks, ensuring smooth information flow and improved performance in long-context modeling.
- SCA underpins architectures such as Gecko for language and extends to speech and vision, offering scalable GPU-friendly operations with reduced memory costs.
Sliding Chunk Attention (SCA) encompasses a class of efficient sparse attention mechanisms that partition input data into fixed-size segments ("chunks") and restrict each attention operation to a localized, potentially overlapping window. SCA is designed to maintain both computational and memory efficiency while mitigating the boundary artifacts and limited receptive fields characteristic of non-overlapping or strictly local attention schemes. The paradigm has been instantiated in LLMs, streaming sequence transducers, and computer vision architectures. Notably, SCA underpins the Gecko architecture, which achieves robust, high-fidelity long-context modeling without context-extension tricks and supports parallel computation friendly to modern accelerator hardware (Ma et al., 10 Jan 2026).
1. Foundational Algorithm and Mathematical Formalism
Sliding Chunk Attention operates by dividing a sequence into non-overlapping chunks of length . For chunk index , the model forms queries, keys, and values and , respectively, via suitable projections (including normalization) of . For each chunk , the attention operation is restricted to the current and immediately preceding chunk: Zero-padding is used for (no previous chunk). The final sequence output 0 is a concatenation of per-chunk outputs. Attention normalization relies on a standard row-wise softmax; upstream, query-key normalization as in Megalodon is employed (Ma et al., 10 Jan 2026).
2. Chunk Construction, Receptive Field, and Windowing
Chunking divides the input into contiguous, non-overlapping windows of length 1. SCA introduces a one-chunk overlap in the attention window: each query in chunk 2 attends to all tokens in chunk 3 and 4. This design ensures that tokens at chunk boundaries retain contextual access, a feature absent from naive chunk-wise attention and associated with "sawtooth" boundary artifacts in loss curves. In SCA, matrix multiplications are performed at the chunk level, supporting batched, contiguous, and accelerator-friendly operations. The sliding property minimizes context discontinuities and supports intra-sequence information flow across chunk boundaries (Ma et al., 10 Jan 2026).
3. Computational Complexity and Hardware Efficiency
Let 5 be the sequence length, 6 the chunk size, and 7. The per-chunk operations require 8 work: (i) 9 with cost 0, and (ii) 1 with 2. Over all 3 chunks, the total cost is 4. Memory cost is 5 for token projections and 6 for chunk-local attention matrices. This contrasts with the 7 time and 8 memory of full self-attention. Unlike Longformer-style sliding windows that require separate per-token computations, SCA's per-chunk batched operations are amenable to high-performance parallelization (Ma et al., 10 Jan 2026).
4. Comparisons with Full and Sparse Attention Mechanisms
SCA achieves an intermediate regime between full self-attention and (non-overlapping) chunk-wise attention. Full self-attention offers unlimited receptive fields but quadratically scaling cost; chunk-wise attention is efficient but introduces severe boundary effects. SCA inherits the 9 scaling of chunk-wise attention but alleviates context loss by overlapping chunk windows. Longformer-style sliding window attention scales linearly in window size but is inefficient on GPUs/TPUs due to per-token attention kernel launches. SCA's design, in contrast, mitigates these inefficiencies while preserving global receptive field growth with depth. Empirically, SCA eliminates negative log-likelihood spikes observed at chunk boundaries in standard chunked attention (Ma et al., 10 Jan 2026).
| Attention Mechanism | Complexity | GPU Efficiency | Boundary Effects |
|---|---|---|---|
| Full (Global) | 0 | Poor | None |
| Chunk-wise | 1 | High | Severe |
| Sliding Window | 2 | Low | Moderate |
| Sliding Chunk (SCA) | 3 | High | Minimal |
5. Empirical Performance and Effects in Sequence Modeling
In large-scale pretraining (Gecko), SCA enables training on 4 million token sequences without context-extension techniques. As context length grows, perplexity continues to decrease, in contrast to flattening observed in alternative architectures (e.g., Megalodon). Gecko with SCA achieves a training loss of 1.68, outperforming Llama2-7B (1.75) and Megalodon-7B (1.70) for a comparable parameter and token budget, closely matching Llama2-13B (1.67). In long-context information retrieval tasks, Gecko robustly retrieves key items from contexts up to 4 longer than its nominal attention window, with 100% passkey retrieval accuracy on needle-in-a-haystack tasks at 5k with 6k attention (Ma et al., 10 Jan 2026).
6. Extensions to Other Modalities and Variants
SCA principles have been instantiated in speech-to-text streaming models and visual text recognition:
- Chunk-wise Attention Transducers (CHAT): Processes audio in fixed-size chunks, employing local cross-attention within each chunk. The joiner performs cross-attention over each chunk and an appended blank frame, reducing peak training memory (up to 46.2%), accelerating training and inference (1.36× and 1.69×, respectively), and yielding significant gains in word error rate (up to 6.3% reduction) and BLEU for streaming speech translation (up to 18.0% improvement) (Xu et al., 27 Feb 2026).
- Sliding Convolutional Attention Network (SCAN): In scene text recognition, sliding windows extract overlapping patches, which are processed by CNNs and 1D convolutional encoders. At each output position, attention is applied over the windowed features, analogous to SCA in the spatial domain—a design that supports full parallelism and interpretable attention distributions (Wu et al., 2018).
7. Limitations, Trade-offs, and Future Directions
SCA trades limited global context per layer for substantial efficiency and parallelization. The one-chunk overlap offers near-local attention while alleviating sharp boundary losses, but global information integration still depends on stack depth. Chunk size selection exposes an accuracy-latency tradeoff: large chunks provide broader context with increased per-chunk memory and potential latency. SCA is best suited for regimes where hardware efficiency and long-sequence handling are critical, and can be further enhanced through stacking with global attention layers or hybrid memory components (as in LMs with adaptive working memory) (Ma et al., 10 Jan 2026, Xu et al., 27 Feb 2026).
For implementation, pseudocode and precise core equations, as well as all empirical results cited, refer directly to the Gecko LLM repository and documentation (Ma et al., 10 Jan 2026).