- The paper introduces Priming, a framework that efficiently converts pre-trained transformers into hybrid state space models through strategic layer selection and weight mapping.
- It demonstrates that GKA-based hybrids markedly improve long-context reasoning and computational throughput, achieving up to 2.3× faster decode speeds.
- The study validates theoretical insights from realization theory and provides practical tools for scalable model deployment in real-world, long-context applications.
Introduction
This work presents a systematic framework, Priming, for constructing Hybrid State Space Models (Hybrid SSMs) by leveraging the weights and architecture of pre-trained Transformers. Rather than relying on resource-intensive pre-training from random initialization, the Priming approach enables efficient creation and scaling of hybrids that combine attention-based and state-space-based memory within a single model. This method is theoretically grounded in realization theory, addresses key limitations in the scalability and design space exploration of hybrid architectures, and provides practical algorithms and open tools for both training and serving such models. Priming yields state-of-the-art results on long-context and reasoning benchmarks, with improved computational efficiency over conventional transformers.
The Priming Framework: Architecture and Algorithm
Hybrid SSMs and the Fading-Eidetic Tradeoff
Hybrid models interleave attention layers (eidetic memory, requiring linear memory in sequence length for KV caching) and SSM layers (fading memory, compressing sequence history into a bounded-size latent state, independent of context length). The architectural goal is to balance retrieval fidelity (precise recall for local information) and scalability (ability to handle extremely long sequences with controlled resource consumption), by distributing eidetic and fading memory submodules judiciously across the network's depth.
Prior research required full pre-training for each new hybrid configuration, restricting large-scale comparisons and limiting architectural exploration. Priming circumvents this by recasting hybridization as a knowledge transfer problem:
- Layer Selection: Select transformer attention layers amenable to SSM replacement using empirical proxies for the layers' mixing matrix Hankel rank (using performance degradation under sliding window attention substitutes).
- Initialization: Map corresponding transformer weights to initialize the SSM layers, based on the matrix-mixer correspondence between attention and state-space mechanisms.
- Alignment (Stage 1): Use end-to-end mean squared error objectives to align the SSMs' outputs to the original attention layers over a broad data distribution.
- Task Adaptation (Stage 2): Fine-tune the hybrid on downstream tasks of interest, e.g., long-context next-token prediction or instruction tuning.
This procedure requires less than 0.5% of a typical pre-trained transformer’s token budget, as opposed to end-to-end retraining.
SSM Layer Families and Expressiveness
Priming enables controlled comparison of different SSM sublayer variants:
- Mamba-2: Uniform, input-dependent scalar decay for fading memory. Simplest and least expressive.
- Gated DeltaNet (GDN): Adds key-specific erasure (rank-1 update), enabling selective forgetting and more precise state updates.
- Gated KalmaNet (GKA): Further generalizes erasure direction to be history-dependent via an online ridge-regression perspective (Kalman filter variant), increasing expressiveness and offering a runtime knob for performance-latency tradeoffs.
Models using GKA outperformed GDN and Mamba-2 systematically on reasoning and long-context tasks, in line with their theoretical expressiveness hierarchy. Notably, GKA exposes the number of Chebyshev iterations as a runtime parameter, enabling practitioners to adjust throughput-vs-accuracy post hoc.
Advancements in Long-Context and Scaling
Hybrid models are constructed and evaluated at both 8B and 32B parameter scales, with native sequence lengths up to 128K tokens, extended to 256K+ using state composition without retraining. Importantly:
- Primed Hybrids match or exceed Transformer quality on a suite of long-context and reasoning benchmarks, with up to 2.3× higher decode throughput and ~2× more concurrent inference capacity at fixed hardware/latency.
- On AIME 2025 and related hard reasoning sets, speedups in wall-clock "time-to-target-accuracy" reach up to 1.6× compared to identically post-trained transformers, a critical property when large-scale RL or parallel sampling workloads are present.
- The authors introduce fused architectures to minimize memory during stage-1 alignment, optimized tiling and Triton kernels for GKA SSMs, and sequence-parallel (P2P and universal) algorithms for long-context training.
Empirical Results and Claims
Head-to-Head SSM Comparison Under Control
- When attention layers are replaced with SSMs based on importance ranking, Hybrid models show that compressed fading memory is more valuable than sliding window attention, even with significantly larger KV cache size for the latter.
- Performance deltas: GKA-based hybrids achieve +3.8 pp over source Qwen3-32B on reasoning benchmarks, staying within 1% of the same transformer post-trained on the same recipe.
- SSM-based hybrids outperform windowed attention-based models by ≥5.5% (relative) on long-context benchmarks, reinforcing the superiority of compressed summary (fading memory) over verbatim sliding window approaches.
Throughput and Scalability
- Inference-time memory usage is nearly halved when half the attention layers are replaced by SSMs, directly improving concurrency and throughput:
- For GKA-based hybrids, decode throughput is up to 2.3× that of the baseline transformer at 128K context.
- B'MOJO-F layers (SSM+SWA) are strictly more expressive than pure SSMs but incur more compute, with their computational benefit arising only at longer contexts unless further kernel-level fusion is applied.
- Test-time scaling: The GKA Chebyshev iteration count can be set dynamically at inference, providing an explicit intelligence-per-FLOP tradeoff—an uncommon feature in LLM architectures.
Training-Free Context Extension
- State composition techniques extend the context window to 2× or 4× the model’s trained limit (e.g., from 128K→256K→512K), exploiting the additive/multiplicative decomposability properties of different SSM families. For GKA, additive composition is effective for sufficiently long chunks.
- At 512K context, hybrids retain meaningful performance, outperforming attention-scaling heuristics (e.g., YaRN) alone, although further improvements in merging strategies are highlighted as future work.
Theoretical Implications
Realization Theory Foundation
The Priming procedure is underpinned by a realization-theoretic analysis:
- Any learned attention head computes a causal sequence-mixing (lower-triangular) matrix; SSMs can realize such a mixing operation exactly if the Hankel rank is less than or equal to the SSM state size.
- Layer selection using importance scoring (via degradations under SWA) acts as a proxy for ranking layers by Hankel rank, retaining those with high rank in attention form and replacing low-rank layers with SSMs.
Functional Specialization and Layer Heterogeneity
Results confirm that transformer layers develop specialized circuits, with certain middle-to-late layers being critical for long-range recall and others being compressible. This functional heterogeneity enables selective SSM hybridization without catastrophic performance loss.
Practical Implications
- Open-source release: Models (8B/32B, long-context, reasoning, instruction-following), codebase (training/inference, sequence parallelism), and inference/serving plugins (vLLM) are openly available.
- Already used infrastructure: The technique directly integrates with production serving engines and existing transformer model families (Qwen3, LLama, Mistral, etc.), facilitating adoption.
Limitations and Future Directions
- Scaling above 32B parameters and to >75% SSM layer ratios is not yet systematically characterized.
- Further gains in 4× or more context-extension, improved chunk-composition methods, per-layer dynamic test-time compute, and studying hybrid architecture's benefit in RL post-training loops remain open.
- Effects of quantization and further kernel optimizations on efficiency gap vs. transformers are areas for future work.
Conclusion
Priming represents a significant advance in efficient, modular, and theoretically grounded cross-architecture transfer for Hybrid State Space Models. It allows the systematic study and deployment of SSM-based hybrids at scale, with strong guarantees on efficiency and competitive downstream performance. The approach is distinguished by its tight coupling of theoretical system realization, empirical rigor, and open-source applicability, paving the way for the next generation of LLM architectures capable of true long-context and agentic reasoning under resource constraints.