ScanWeaver: Compiler-Driven Parallelization of Affine Recurrences via Associative Scan Lowering

Published 30 May 2026 in cs.PL and cs.DC | (2606.00601v1)

Abstract: Selective state-space models such as Mamba highlight the practical importance of input-dependent scan recurrences, which preserve linear-time sequence modeling while improving language modeling capabilities. However, these recurrences introduce stricter sequential dependencies than classical structured SSMs, limiting parallel execution on modern accelerators. We present \textbf{ScanWeaver}, a compiler framework that transforms recurrence-based computations into associative scan representations and lowers them end-to-end to executable GPU programs. We use Mamba-style selective scan as a motivating example of a broader class of affine recurrences that arise in modern ML workloads. Rather than targeting a single model family, ScanWeaver elevates this recurrence structure to a first-class compiler abstraction, enabling systematic MLIR-based lowering to compiler-generated Blelloch scan execution on GPUs. Across forward selective-scan workloads with matched local recurrence semantics, we validate affine recurrence decomposition, Blelloch lowering, MLIR GPU lowering, executable artifact generation, and actual GPU execution from generated MLIR. We benchmark the resulting ScanWeaver GPU execution against PyTorch and CUDA sequential baselines, and include the Mamba kernel as a fused production baseline for systems context.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a compiler-driven affine scan lowering technique that converts sequential affine recurrences into parallel Blelloch scan schedules.
It leverages MLIR for systematic IR transformation and demonstrates latency reduction from 51.7 ms to 0.032 ms at a sequence length of 1024.
The approach generalizes to diverse recurrence-based computations while highlighting numerical stability challenges with aggressive exponential parameterizations.

ScanWeaver: Compiler-Driven Parallelization of Affine Recurrences via Associative Scan Lowering

Motivation and Affine Scan Abstraction

Sequential recurrences, pervasive in state-space models (SSMs), dynamic programming, and recurrent neural networks (RNNs), significantly impede parallel execution on modern accelerators due to their inherent step-wise dependencies. Recent advances like Mamba foreground input-dependent selective scan recurrences ( $h_t = a_t h_{t-1} + b_t x_t$ ), retaining linear-time traversal but imposing strict sequential constraints.

ScanWeaver addresses this by reframing recurrence computation as an associative affine scan, which decomposes sequential recurrences into compositions of affine transition pairs: $(A_t, U_t) = (a_t, b_t x_t)$ . Associative composition, $(A_2,U_2)\oplus(A_1,U_1) = (A_2A_1, A_2 U_1 + U_2)$ , facilitates prefix scan evaluation and exposes parallel structure amenable to scan algorithms.

(Figure 1)

Figure 1: Affine scan reformulates sequential recurrence traversal into a parallel prefix schedule over sequence dimension $L$ .

This conversion enables systematic compiler-driven lowering, turning what is nominally $O(L)$ dependency depth into $O(\log L)$ via Blelloch scan schedules, radically increasing parallel execution throughput on GPU architectures.

MLIR-Based Lowering Pipeline

ScanWeaver leverages MLIR's flexible IR stacking to instantiate affine scan as a compiler abstraction, supporting decomposition, scan lowering, and explicit execution mapping. The pipeline progresses from high-level recurrence IR through:

Affine normalization
Associative scan decomposition
Explicit Blelloch scan schedule materialization (upsweep, root init, downsweep)
MLIR GPU lowering
Artifact generation and execution on CUDA GPU

The affine scan abstraction decouples recurrence logic from realization schedules, enabling principled reuse across operator classes and generalization beyond selective scan, such as weighted prefix recurrences.

Implementation and Execution Strategies

ScanWeaver supports multiple execution paths:

PyTorch sequential reference
Naive CUDA recurrence traversal
Compiler-generated MLIR GPU Blelloch scan
Parallel CUDA Blelloch scan backend

By staging affine transition pairs in shared memory and mapping sequence dimension $L$ across scan trees, ScanWeaver achieves substantial reduction in sequential bottlenecks. The Blelloch scan reflects a logarithmic dependency structure, supplanting strictly sequential traversal with hierarchical parallel composition.

Numerical Results and Latency Scaling

ScanWeaver’s compiled Blelloch scan backend demonstrates markedly superior latency profile compared to both PyTorch sequential reference and naive CUDA implementations across increasing sequence lengths. At $L = 1024$ , measured latency is $0.032$ ms for ScanWeaver Blelloch, in stark contrast to $0.469$ ms (naive CUDA) and $(A_t, U_t) = (a_t, b_t x_t)$ 0 ms (PyTorch).

Figure 2: Latency scaling across sequence lengths; ScanWeaver Blelloch backend exhibits flatter scaling than sequential traversal due to logarithmic-depth parallel prefix scheduling.

This demonstrates concrete improvement in realized parallelism, with overhead attributable primarily to GPU synchronization and launch costs rather than arithmetic computation.

Generalization, Numerical Stability, and Systems Implications

ScanWeaver’s affine scan transformation is not limited to Mamba-like selective scan but generalizes to any affine recurrence that admits prefix composition. This abstraction enables compiler systematically to recover parallel execution for a broad set of recurrent ML workloads.

Implementation correctness holds across tested ranges of parameters, but aggressive exponential parameterization induces numerical instability—magnifying floating-point discrepancies and warranting separate characterization.

The compiler-centric approach of ScanWeaver challenges the necessity of bespoke kernel engineering for every operator, offering the potential for unified recurrence parallelization in future ML stacks, with further extensibility toward alternative scan schedules and distributed execution.

Comparison with Production Kernels and Limitations

While operator-specialized kernels (e.g., Mamba’s fused CUDA kernel) deliver high performance via hand-tuned memory access and fusion, ScanWeaver focuses on generalized compiler lowering and structural parallelization. The primary aim is not strict outperformance but rather to demonstrate that compilation-driven affine scan abstraction is viable and portable.

Notable limitations include instability under exponential parameters, restriction to Blelloch schedules, exclusion of gating/reset mechanisms, and absence of distributed/multi-GPU execution strategies.

Conclusion

ScanWeaver elevates affine scan to a compiler primitive, enabling systematic transformation and parallelization of recurrence-based computations via associative scan lowering and MLIR-driven GPU execution. The numerical results validate substantial reductions in latency, especially for long sequence traversals. This lays groundwork for universal recurrence parallelization schemes in ML compiler stacks, with implications for advancing both performance portability and structured IR lowering in accelerator-centric workloads.

Markdown Report Issue