Prefix-Scannable Models (PSMs)
- Prefix-Scannable Models (PSMs) are a unifying class of neural sequence models that leverage prefix scan mechanics to achieve parallel training and efficient sequential inference.
- They operate by chunk encoding, fixed-tree state aggregation via the Blelloch scan, and an inference module that processes streaming data with low-memory overhead.
- PSMs generalize architectures like RNNs, state-space models, and transformers, providing a flexible framework for scalable, long-context modeling and real-time applications.
Prefix-Scannable Models (PSMs) are a broad, unifying class of neural sequence models whose internal computations align with the structure of the prefix scan (also known as the Blelloch scan), a fundamental parallel algorithm. This property enables such models to achieve efficient, parallelizable training and low-latency, streaming inference. PSMs generalize and connect a range of architectures, including classic RNNs, state-space models, linear transformers, and chunked-attention transformers, by abstracting their state update and inference through the lens of prefix scan mechanics. They offer a principled framework for building neural sequence models that reconcile the demands for high expressivity, scalable parallelism, and sequential inference efficiency.
1. Foundational Definition and Formalization
Prefix-Scannable Models are formally defined over sequences divided into contiguous chunks, with three key learnable components:
- Encoder: , encoding each chunk into a prefix representation.
- Aggregator: , updating the prefix state from previous and current chunk representations. Notably, need not be associative, enabling support for both RNN-like and attention-based models.
- Inference Module: , mapping the current prefix state and input chunk to the output.
The computation on an input sequence $\va_{0:n-1}$ of length and chunk size (with chunks) proceeds as follows:
- Encoding: $\vx_i = \mathsf{Enc}(\mathcal{C}_i)$
- Prefix State Assignment: $\{\vs_i\}_{i=0}^{r-1} = \text{BlellochScan}(\{\vx_i\}, \mathsf{Agg}_\theta, \ve)$, where $\ve$ is the identity in .
- Prediction: $\hat{\vy}_{ic:(i+1)c-1} = \mathsf{Inf}_\phi(\vs_{i-1}, \mathcal{C}_i)$
This architecture allows both parallel training (via the prefix scan) and sequential inference (via an efficient online scan with logarithmic state memory).
2. Sequential-Parallel Duality in PSMs
A defining property of PSMs is sequential-parallel duality: the capability for highly parallel computation during training and efficient, low-memory streaming operation at inference. Specifically, PSMs satisfy the requirements of:
- Parallel Training: The use of the parallel prefix scan algorithm (Blelloch scan) allows per-token or per-chunk computations in (polylogarithmic) circuit depth with linear work in sequence length .
- Efficient Inference: Using a binary-counter online scan, PSMs can, at inference time, compute per-token outputs in amortized time per token (per chunk) and store only prefix state entries—dramatically reducing the memory overhead compared to classic transformers.
Crucially, the equivalence between static (parallel) prefix computation and online sequential scan holds even for non-associative aggregator functions, as long as a fixed parenthesization is maintained. This unifies the computational strategy for both training and streaming serving within a single model family.
3. Relationship to State Space Models and Transformer Families
PSMs extend traditional state space models (SSMs) and linear transformers by relaxing the requirement for an associative aggregator. In SSMs, the state is updated via an affine operation:
$\vs_t = \mE_t \blacktriangleright \vs_{t-1} + \vf_t, \qquad \vs_{-1} = 0$
with aggregation: $(\mE_2, \vf_2) \oplus (\mE_1, \vf_1) = (\mE_2 \circ \mE_1, \vf_2 + \mE_2 \blacktriangleright \vf_1)$. This associative property enables scan-based parallelization as in Mamba, RWKV, and related models.
PSMs generalize this setup: the aggregator can be arbitrary, e.g., a non-associative function like softmax attention used in transformer blocks. This enables inclusion of models that use chunked or localized attention (e.g., a chunk-wise transformer), subsuming both affine SSMs and advanced transformers in a unified framework.
4. Mathematical Structure and Computation
PSMs operate via the following mathematical protocol:
- Chunk Encoding: $\forall i, \vx_i = \mathsf{Enc}(\mathcal{C}_i)$
- State Aggregation via Prefix Scan:
$\vs_i = \mathsf{Agg}_\theta(\vs_{i-1}, \vx_i)$
or more generally, for non-associative , a fixed scan tree (Blelloch scan) ensures deterministic computation.
- Inference per Chunk:
$\hat{\vy}_{ic:(i+1)c-1} = \mathsf{Inf}_\phi(\vs_{i-1}, \mathcal{C}_i)$
The time and memory complexities are as follows:
Operation Type | Time/Memory (PSM, chunk size ) |
---|---|
Training (parallel) | |
Inference (seq.) | |
Per-token latency | |
Memory |
Associative cases (standard SSMs) allow constant-state memory for unbounded sequences; more expressive chunked-attention PSMs require only logarithmic additional memory.
5. Empirical Evidence and Performance
Comprehensive empirical evaluation on both synthetic and real-world tasks demonstrates the practical efficacy and flexibility of PSMs:
State Tracking Task ()
Transformer-PSMs using non-associative (softmax-based, chunked) aggregation outperform both standard GPT-2 and Mamba models in length generalization, maintaining low error rates even when sequence lengths are far outside the training distribution.
Multi-Query Associative Recall (MQAR)
Transformer-PSMs achieve nearly perfect recall for demanding memory tasks, matching or exceeding full transformer performance at large chunk sizes and outperforming sliding window transformers and SSMs (here Mamba) in situations that require long-term memory and recall.
LLMing (WikiText-103)
With moderate chunk sizes, Transformer-PSMs deliver LLMing performance nearly matching fully self-attention-based transformers, while providing constant per-token inference time—unlike the linear per-token scaling of standard transformers.
The choice of chunk size allows explicit tuning of the trade-off between context span (transformer-like) and efficiency (RNN/SSM-like), giving practitioners a flexible design knob.
6. Applications, Implications, and Design Unification
PSMs provide a blueprint for constructing new sequence architectures that are:
- Efficient and parallelizable in training (for large batch/sequence workloads).
- Highly efficient and low-memory in inference (for streaming, low-latency, or resource-constrained environments).
- Expressive: By incorporating non-associative and attention-based aggregators, PSMs can match or outperform standard transformers and SSMs on a range of language and algorithmic tasks.
- Unifying: PSMs generalize and subsume Mamba, RWKV, Gated Linear Attention, mLSTM, linear transformers, and chunked-attention transformers within one formal framework.
A practical implication is that PSMs support rapid model experimentation: researchers can choose, design, and combine encoding, aggregation, and inference modules to create new architectures that seamlessly interpolate between transformer and state space properties. PSMs are well-suited for long-context modeling, real-time language understanding, sequential event processing, and efficient reasoning tasks.
7. Mathematical Table: Model Classes and Duality
Model Class | Parallel Training Depth & Cost | Sequential Inference Time | Memory | Aggregator Requirement |
---|---|---|---|---|
Transformer | per token | Softmax (non-assoc.) | ||
Elementwise RNN/SSM | per token | Affine (assoc.) | ||
Generic PSM | per token (amort.) | Any, incl. non-assoc. |
Conclusion
Prefix-Scannable Models (PSMs) constitute a mathematically rigorous, resource-efficient, and expressive class of neural sequence architectures. They reconcile the demands of high-throughput, parallel training and fast, streaming inference, supporting both associative (state-based) and non-associative (attention-based) update schemes. PSMs underpin a broad family of contemporary architectures—uniting transformers, state-space models, and hybrid designs—and provide a theoretical and practical framework for future advancements in efficient and scalable sequence modeling.