TOSE: Tiled Operator-Space Evolution in SSMs
- TOSE is a core algorithmic primitive that achieves exact analytical differentiation in linear state-space models with strict O(1) memory usage via tiling.
- It reframes gradient computation as forward evolution of an augmented dynamical system, detaching intermediate histories to eliminate memory scaling with sequence length.
- Empirical results show TOSE reduces VRAM usage by up to 94% and improves throughput significantly, making it ideal for large-scale applications like genomics.
Tiled Operator-Space Evolution (TOSE) is a core algorithmic primitive for O(1)-memory exact analytical differentiation through linear state-space models (SSMs), enabling efficient sensitivity analysis for large-scale sequence modeling tasks. TOSE reframes the gradient computation as the forward evolution of an augmented dynamical system, exploiting the associative structure of operator evolution and leveraging a tiling and scan-based protocol to eliminate the memory overhead inherent in conventional backpropagation. TOSE supports applications such as chromosome-scale modeling in genomics, where conventional Autograd approaches encounter prohibitive memory costs.
1. Theoretical Basis and Role in Phase Gradient Flow
TOSE is underpinning the Phase Gradient Flow (PGF) framework, which addresses the memory bottleneck in gradient-based sensitivity analysis for large SSMs. PGF interprets backpropagation through a linear recurrence as the forward evolution of a “tangent” dynamical system. For a discrete-time SSM defined by
the Fréchet derivative product evolves under a linear recurrence isomorphic to the original dynamics.
TOSE leverages this dynamical isomorphism by tiling the sequence into fixed-size blocks. Within each tile, both the primal and tangent systems are evolved via a parallel scan over associative operator products. The computational graph is detached at tile boundaries, resulting in memory usage that does not scale with sequence length . This enables strictly O(1) differentiation graph memory in (Wang et al., 28 Dec 2025).
2. Mathematical Foundation
2.1 Primal and Tangent Recursions
Given , and allowing for diagonal or block-diagonal, possibly time-varying system matrices (as in Mamba/S4), the tangent dynamics for sensitivity analysis (the variation ) satisfy
where, in the diagonal SSM setting, all derivatives and per-step operators remain element-wise linear.
2.2 Augmented Operator-Space
The triplet is encoded as a single augmented vector, and an augmented block-matrix recurrence is defined: where each is
with and .
The associativity allows parallel prefix scan computation of these operator products. The diagonal structure reduces the computation, for each coordinate, to products of compact () blocks.
3. Tiling and Parallel Scan Protocol
Sequence length is partitioned into tiles of size . For each tile spanning steps , the per-step operator is constructed and a parallel scan (prefix product) is performed over the block. Only the state at the tile boundary is retained for subsequent computation, and intermediate autograd history within the tile is detached, yielding memory complexity independent of .
The TOSE protocol is formalized as follows:
| Step | Operation | Memory Effect |
|---|---|---|
| a) Load block inputs | No graph growth | |
| b) Build per-step operators | Construct for tile | Per-step, local |
| c) ParallelScan over | Compute | Intra-tile only |
| d) Output sensitivity | Store external output | |
| e) Boundary state detach | updated via .detach() | Limits recursive history |
TOSE relies on an arbitrary associative parallel prefix-product implementation (e.g., tree-based scan) to combine per-step operators within each tile.
4. Memory Complexity and Differentiation Graph Architecture
By detaching computational history at tile boundaries, the memory footprint for automatic differentiation is proportional to , a strict constant in as . All input and output tensors (e.g., ) retain linear scaling in , but these are not part of the recursive differentiation graph. TOSE thus guarantees strictly O(1) memory scaling for the differentiation graph with respect to sequence length.
5. Numerical Stability in Stiff ODE Regimes
Conventional parallel scan algorithms exhibit catastrophic underflow or overflow for very stiff ODEs (e.g., with eigenvalues ) in finite precision arithmetic. TOSE incorporates a "log-shifting stabilizer," normalizing intermediate products within each tile by subtracting the local maximum log-magnitude prior to exponentiation and restoring the shift post-scan. Analytical error analysis demonstrates that the relative error after steps is bounded by
with governed by the spectral radius of ( in typical scenarios). Empirical regression of error versus shows no slope (), confirming no error accumulation with increasing sequence length. This distinguishes TOSE from naive parallel prefix schemes, which fail under extreme stiffness for large (Wang et al., 28 Dec 2025).
6. Empirical Performance and Scalability
6.1 Memory Utilization
Standard Autograd backpropagation incurs activation storage; at , a single layer exceeds $10$ GB VRAM. TOSE maintains graph memory at a constant level irrespective of . Across , PGF achieves a reduction in peak VRAM usage on NVIDIA RTX 5090 hardware.
6.2 Throughput
On an RTX 5060 Laptop GPU, PGF demonstrates throughput improvement over reverse-mode Autograd at , and sustains higher throughput for (“chromosome scale”). This efficiency was evaluated via samples per second (see Figures 1 and 3 in (Wang et al., 28 Dec 2025)).
6.3 Sensitivity Analysis Benchmarks
In a $128,000$-step impulse-response (“Ghost Pulse”) benchmark, TOSE precisely recovers micro-perturbations at late timesteps () without numerical leakage, a regime where Autograd encounters out-of-memory failure in multi-layered models.
7. Applications and Significance
TOSE enables feasible genome-scale sensitivity analysis () on single 8–12 GB GPUs. This bridges the practical gap between infinite-context theoretical models and existing hardware constraints. The protocol is applicable to any linear SSM and directly supports models with diagonal or block-diagonal system matrices, including Mamba and S4 variants.
A plausible implication is that TOSE could generalize to other operator evolution domains beyond SSMs, provided they admit associative and scan-friendly forms. The minimal error accumulation and strict memory guarantees position TOSE as a foundational tool for large-scale differentiable modeling, particularly in genomics, signal processing, and other sequence analysis contexts where computational memory was previously prohibitive (Wang et al., 28 Dec 2025).