Sliding Chunk Attention (SCA)

Updated 1 May 2026

Sliding Chunk Attention (SCA) is an efficient attention mechanism that partitions sequences into overlapping chunks to balance local context and computational efficiency.
It minimizes boundary artifacts by overlapping chunks, ensuring smooth information flow and improved performance in long-context modeling.
SCA underpins architectures such as Gecko for language and extends to speech and vision, offering scalable GPU-friendly operations with reduced memory costs.

Sliding Chunk Attention (SCA) encompasses a class of efficient sparse attention mechanisms that partition input data into fixed-size segments ("chunks") and restrict each attention operation to a localized, potentially overlapping window. SCA is designed to maintain both computational and memory efficiency while mitigating the boundary artifacts and limited receptive fields characteristic of non-overlapping or strictly local attention schemes. The paradigm has been instantiated in LLMs, streaming sequence transducers, and computer vision architectures. Notably, SCA underpins the Gecko architecture, which achieves robust, high-fidelity long-context modeling without context-extension tricks and supports parallel computation friendly to modern accelerator hardware (Ma et al., 10 Jan 2026).

1. Foundational Algorithm and Mathematical Formalism

Sliding Chunk Attention operates by dividing a sequence $X \in \mathbb{R}^{n \times d}$ into $S = n/c$ non-overlapping chunks of length $c$ . For chunk index $s \in \{1,\ldots,S\}$ , the model forms queries, keys, and values $Q_s, K_s \in \mathbb{R}^{c \times d_k}$ and $V_s \in \mathbb{R}^{c \times d_v}$ , respectively, via suitable projections (including normalization) of $X$ . For each chunk $s$ , the attention operation is restricted to the current and immediately preceding chunk: $\begin{align*} K_{\mathrm{cat}} &= [K_{s-1}; K_s] \in \mathbb{R}^{2c \times d_k} \ V_{\mathrm{cat}} &= [V_{s-1}; V_s] \in \mathbb{R}^{2c \times d_v} \ A_s &= Q_s K_{\mathrm{cat}}^{T} \in \mathbb{R}^{c \times 2c} \ W_s &= \mathrm{softmax}(A_s) \in \mathbb{R}^{c \times 2c} \ O_s &= W_s V_{\mathrm{cat}} \in \mathbb{R}^{c \times d_v} \end{align*}$ Zero-padding is used for $s=1$ (no previous chunk). The final sequence output $S = n/c$ 0 is a concatenation of per-chunk outputs. Attention normalization relies on a standard row-wise softmax; upstream, query-key normalization as in Megalodon is employed (Ma et al., 10 Jan 2026).

2. Chunk Construction, Receptive Field, and Windowing

Chunking divides the input into contiguous, non-overlapping windows of length $S = n/c$ 1. SCA introduces a one-chunk overlap in the attention window: each query in chunk $S = n/c$ 2 attends to all tokens in chunk $S = n/c$ 3 and $S = n/c$ 4. This design ensures that tokens at chunk boundaries retain contextual access, a feature absent from naive chunk-wise attention and associated with "sawtooth" boundary artifacts in loss curves. In SCA, matrix multiplications are performed at the chunk level, supporting batched, contiguous, and accelerator-friendly operations. The sliding property minimizes context discontinuities and supports intra-sequence information flow across chunk boundaries (Ma et al., 10 Jan 2026).

3. Computational Complexity and Hardware Efficiency

Let $S = n/c$ 5 be the sequence length, $S = n/c$ 6 the chunk size, and $S = n/c$ 7. The per-chunk operations require $S = n/c$ 8 work: (i) $S = n/c$ 9 with cost $c$ 0, and (ii) $c$ 1 with $c$ 2. Over all $c$ 3 chunks, the total cost is $c$ 4. Memory cost is $c$ 5 for token projections and $c$ 6 for chunk-local attention matrices. This contrasts with the $c$ 7 time and $c$ 8 memory of full self-attention. Unlike Longformer-style sliding windows that require separate per-token computations, SCA's per-chunk batched operations are amenable to high-performance parallelization (Ma et al., 10 Jan 2026).

4. Comparisons with Full and Sparse Attention Mechanisms

SCA achieves an intermediate regime between full self-attention and (non-overlapping) chunk-wise attention. Full self-attention offers unlimited receptive fields but quadratically scaling cost; chunk-wise attention is efficient but introduces severe boundary effects. SCA inherits the $c$ 9 scaling of chunk-wise attention but alleviates context loss by overlapping chunk windows. Longformer-style sliding window attention scales linearly in window size but is inefficient on GPUs/TPUs due to per-token attention kernel launches. SCA's design, in contrast, mitigates these inefficiencies while preserving global receptive field growth with depth. Empirically, SCA eliminates negative log-likelihood spikes observed at chunk boundaries in standard chunked attention (Ma et al., 10 Jan 2026).

Attention Mechanism	Complexity	GPU Efficiency	Boundary Effects
Full (Global)	$s \in \{1,\ldots,S\}$ 0	Poor	None
Chunk-wise	$s \in \{1,\ldots,S\}$ 1	High	Severe
Sliding Window	$s \in \{1,\ldots,S\}$ 2	Low	Moderate
Sliding Chunk (SCA)	$s \in \{1,\ldots,S\}$ 3	High	Minimal

5. Empirical Performance and Effects in Sequence Modeling

In large-scale pretraining (Gecko), SCA enables training on 4 million token sequences without context-extension techniques. As context length grows, perplexity continues to decrease, in contrast to flattening observed in alternative architectures (e.g., Megalodon). Gecko with SCA achieves a training loss of 1.68, outperforming Llama2-7B (1.75) and Megalodon-7B (1.70) for a comparable parameter and token budget, closely matching Llama2-13B (1.67). In long-context information retrieval tasks, Gecko robustly retrieves key items from contexts up to $s \in \{1,\ldots,S\}$ 4 longer than its nominal attention window, with 100% passkey retrieval accuracy on needle-in-a-haystack tasks at $s \in \{1,\ldots,S\}$ 5k with $s \in \{1,\ldots,S\}$ 6k attention (Ma et al., 10 Jan 2026).

6. Extensions to Other Modalities and Variants

SCA principles have been instantiated in speech-to-text streaming models and visual text recognition:

Chunk-wise Attention Transducers (CHAT): Processes audio in fixed-size chunks, employing local cross-attention within each chunk. The joiner performs cross-attention over each chunk and an appended blank frame, reducing peak training memory (up to 46.2%), accelerating training and inference (1.36× and 1.69×, respectively), and yielding significant gains in word error rate (up to 6.3% reduction) and BLEU for streaming speech translation (up to 18.0% improvement) (Xu et al., 27 Feb 2026).
Sliding Convolutional Attention Network (SCAN): In scene text recognition, sliding windows extract overlapping patches, which are processed by CNNs and 1D convolutional encoders. At each output position, attention is applied over the windowed features, analogous to SCA in the spatial domain—a design that supports full parallelism and interpretable attention distributions (Wu et al., 2018).

7. Limitations, Trade-offs, and Future Directions

SCA trades limited global context per layer for substantial efficiency and parallelization. The one-chunk overlap offers near-local attention while alleviating sharp boundary losses, but global information integration still depends on stack depth. Chunk size selection exposes an accuracy-latency tradeoff: large chunks provide broader context with increased per-chunk memory and potential latency. SCA is best suited for regimes where hardware efficiency and long-sequence handling are critical, and can be further enhanced through stacking with global attention layers or hybrid memory components (as in LMs with adaptive working memory) (Ma et al., 10 Jan 2026, Xu et al., 27 Feb 2026).

For implementation, pseudocode and precise core equations, as well as all empirical results cited, refer directly to the Gecko LLM repository and documentation (Ma et al., 10 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths (2026)

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text (2026)

SCAN: Sliding Convolutional Attention Network for Scene Text Recognition (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sliding Chunk Attention (SCA).

Sliding Chunk Attention (SCA)

1. Foundational Algorithm and Mathematical Formalism

2. Chunk Construction, Receptive Field, and Windowing

3. Computational Complexity and Hardware Efficiency

4. Comparisons with Full and Sparse Attention Mechanisms

5. Empirical Performance and Effects in Sequence Modeling

6. Extensions to Other Modalities and Variants

7. Limitations, Trade-offs, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sliding Chunk Attention (SCA)

1. Foundational Algorithm and Mathematical Formalism

2. Chunk Construction, Receptive Field, and Windowing

3. Computational Complexity and Hardware Efficiency

4. Comparisons with Full and Sparse Attention Mechanisms

5. Empirical Performance and Effects in Sequence Modeling

6. Extensions to Other Modalities and Variants

7. Limitations, Trade-offs, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research