Priming: Hybrid State Space Models From Pre-trained Transformers

Published 8 May 2026 in cs.LG and cs.AI | (2605.08301v1)

Abstract: Hybrid State-Space models combine Attention with recurrent State-Space Model (SSM) layers, balancing eidetic memory from Attention with compressed fading memory from SSMs. This yields smaller Key-Value caches and faster decoding than Transformers, along with a richer architectural design space. Exploring that design space at scale has so far required training from scratch, a barrier that has kept most large-model Hybrid research within a narrow range of architectures. We introduce Priming, a method that turns Hybrid architecture design from a pre-training problem into a knowledge transfer one. Priming initializes a Hybrid model from a pre-trained Transformer and, through short alignment and post-training phases, recovers downstream quality using less than 0.5% of the source model's pre-training token budget. Priming is agnostic to the source Transformer family (e.g., Qwen, Llama, Mistral), model class (dense or Mixture-of-Experts), and model scale. Priming enables us to run the first controlled comparison of SSM layer types at scale under identical conditions. We evaluate, Gated KalmaNet (GKA), Gated DeltaNet (GDN), and Mamba-2, and show that their expressiveness hierarchy, GKA>GDN>Mamba-2, directly predicts downstream performance on long-context reasoning tasks. We scale Priming to 8B/32B reasoning models with native 128K contexts. Our Hybrid GKA 32B improves over its source Qwen3-32B by +3.8 average reasoning points, while staying within 1% of a Transformer post-trained on the same data and enabling up to 2.3x higher decode throughput. To foster research on Hybrid architectures, we release a model zoo of primed Hybrid models for long-context reasoning and instruction following, together with the Priming training and inference code (Sequence Parallelism algorithms for long-context training, optimized GKA kernels, and vLLM serving plugin), all under Apache~2.0 License.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces Priming, a framework that efficiently converts pre-trained transformers into hybrid state space models through strategic layer selection and weight mapping.
It demonstrates that GKA-based hybrids markedly improve long-context reasoning and computational throughput, achieving up to 2.3× faster decode speeds.
The study validates theoretical insights from realization theory and provides practical tools for scalable model deployment in real-world, long-context applications.

Hybrid State Space Models via Priming: An Expert Analysis of "Priming: Hybrid State Space Models From Pre-trained Transformers" (2605.08301)

Introduction

This work presents a systematic framework, Priming, for constructing Hybrid State Space Models (Hybrid SSMs) by leveraging the weights and architecture of pre-trained Transformers. Rather than relying on resource-intensive pre-training from random initialization, the Priming approach enables efficient creation and scaling of hybrids that combine attention-based and state-space-based memory within a single model. This method is theoretically grounded in realization theory, addresses key limitations in the scalability and design space exploration of hybrid architectures, and provides practical algorithms and open tools for both training and serving such models. Priming yields state-of-the-art results on long-context and reasoning benchmarks, with improved computational efficiency over conventional transformers.

The Priming Framework: Architecture and Algorithm

Hybrid SSMs and the Fading-Eidetic Tradeoff

Hybrid models interleave attention layers (eidetic memory, requiring linear memory in sequence length for KV caching) and SSM layers (fading memory, compressing sequence history into a bounded-size latent state, independent of context length). The architectural goal is to balance retrieval fidelity (precise recall for local information) and scalability (ability to handle extremely long sequences with controlled resource consumption), by distributing eidetic and fading memory submodules judiciously across the network's depth.

Prior research required full pre-training for each new hybrid configuration, restricting large-scale comparisons and limiting architectural exploration. Priming circumvents this by recasting hybridization as a knowledge transfer problem:

Layer Selection: Select transformer attention layers amenable to SSM replacement using empirical proxies for the layers' mixing matrix Hankel rank (using performance degradation under sliding window attention substitutes).
Initialization: Map corresponding transformer weights to initialize the SSM layers, based on the matrix-mixer correspondence between attention and state-space mechanisms.
Alignment (Stage 1): Use end-to-end mean squared error objectives to align the SSMs' outputs to the original attention layers over a broad data distribution.
Task Adaptation (Stage 2): Fine-tune the hybrid on downstream tasks of interest, e.g., long-context next-token prediction or instruction tuning.

This procedure requires less than 0.5% of a typical pre-trained transformer’s token budget, as opposed to end-to-end retraining.

SSM Layer Families and Expressiveness

Priming enables controlled comparison of different SSM sublayer variants:

Mamba-2: Uniform, input-dependent scalar decay for fading memory. Simplest and least expressive.
Gated DeltaNet (GDN): Adds key-specific erasure (rank-1 update), enabling selective forgetting and more precise state updates.
Gated KalmaNet (GKA): Further generalizes erasure direction to be history-dependent via an online ridge-regression perspective (Kalman filter variant), increasing expressiveness and offering a runtime knob for performance-latency tradeoffs.

Models using GKA outperformed GDN and Mamba-2 systematically on reasoning and long-context tasks, in line with their theoretical expressiveness hierarchy. Notably, GKA exposes the number of Chebyshev iterations as a runtime parameter, enabling practitioners to adjust throughput-vs-accuracy post hoc.

Advancements in Long-Context and Scaling

Hybrid models are constructed and evaluated at both 8B and 32B parameter scales, with native sequence lengths up to 128K tokens, extended to 256K+ using state composition without retraining. Importantly:

Primed Hybrids match or exceed Transformer quality on a suite of long-context and reasoning benchmarks, with up to 2.3× higher decode throughput and ~2× more concurrent inference capacity at fixed hardware/latency.
On AIME 2025 and related hard reasoning sets, speedups in wall-clock "time-to-target-accuracy" reach up to 1.6× compared to identically post-trained transformers, a critical property when large-scale RL or parallel sampling workloads are present.
The authors introduce fused architectures to minimize memory during stage-1 alignment, optimized tiling and Triton kernels for GKA SSMs, and sequence-parallel (P2P and universal) algorithms for long-context training.

Empirical Results and Claims

Head-to-Head SSM Comparison Under Control

When attention layers are replaced with SSMs based on importance ranking, Hybrid models show that compressed fading memory is more valuable than sliding window attention, even with significantly larger KV cache size for the latter.
Performance deltas: GKA-based hybrids achieve +3.8 pp over source Qwen3-32B on reasoning benchmarks, staying within 1% of the same transformer post-trained on the same recipe.
SSM-based hybrids outperform windowed attention-based models by ≥5.5% (relative) on long-context benchmarks, reinforcing the superiority of compressed summary (fading memory) over verbatim sliding window approaches.

Throughput and Scalability

Inference-time memory usage is nearly halved when half the attention layers are replaced by SSMs, directly improving concurrency and throughput:
- For GKA-based hybrids, decode throughput is up to 2.3× that of the baseline transformer at 128K context.
- B'MOJO-F layers (SSM+SWA) are strictly more expressive than pure SSMs but incur more compute, with their computational benefit arising only at longer contexts unless further kernel-level fusion is applied.
Test-time scaling: The GKA Chebyshev iteration count can be set dynamically at inference, providing an explicit intelligence-per-FLOP tradeoff—an uncommon feature in LLM architectures.

Training-Free Context Extension

State composition techniques extend the context window to 2× or 4× the model’s trained limit (e.g., from 128K→256K→512K), exploiting the additive/multiplicative decomposability properties of different SSM families. For GKA, additive composition is effective for sufficiently long chunks.
At 512K context, hybrids retain meaningful performance, outperforming attention-scaling heuristics (e.g., YaRN) alone, although further improvements in merging strategies are highlighted as future work.

Theoretical Implications

Realization Theory Foundation

The Priming procedure is underpinned by a realization-theoretic analysis:

Any learned attention head computes a causal sequence-mixing (lower-triangular) matrix; SSMs can realize such a mixing operation exactly if the Hankel rank is less than or equal to the SSM state size.
Layer selection using importance scoring (via degradations under SWA) acts as a proxy for ranking layers by Hankel rank, retaining those with high rank in attention form and replacing low-rank layers with SSMs.

Functional Specialization and Layer Heterogeneity

Results confirm that transformer layers develop specialized circuits, with certain middle-to-late layers being critical for long-range recall and others being compressible. This functional heterogeneity enables selective SSM hybridization without catastrophic performance loss.

Practical Implications

Open-source release: Models (8B/32B, long-context, reasoning, instruction-following), codebase (training/inference, sequence parallelism), and inference/serving plugins (vLLM) are openly available.
Already used infrastructure: The technique directly integrates with production serving engines and existing transformer model families (Qwen3, LLama, Mistral, etc.), facilitating adoption.

Limitations and Future Directions

Scaling above 32B parameters and to >75% SSM layer ratios is not yet systematically characterized.
Further gains in 4× or more context-extension, improved chunk-composition methods, per-layer dynamic test-time compute, and studying hybrid architecture's benefit in RL post-training loops remain open.
Effects of quantization and further kernel optimizations on efficiency gap vs. transformers are areas for future work.

Conclusion

Priming represents a significant advance in efficient, modular, and theoretically grounded cross-architecture transfer for Hybrid State Space Models. It allows the systematic study and deployment of SSM-based hybrids at scale, with strong guarantees on efficiency and competitive downstream performance. The approach is distinguished by its tight coupling of theoretical system realization, empirical rigor, and open-source applicability, paving the way for the next generation of LLM architectures capable of true long-context and agentic reasoning under resource constraints.

Markdown Report Issue