Papers
Topics
Authors
Recent
2000 character limit reached

TOSE: Tiled Operator-Space Evolution in SSMs

Updated 4 January 2026
  • TOSE is a core algorithmic primitive that achieves exact analytical differentiation in linear state-space models with strict O(1) memory usage via tiling.
  • It reframes gradient computation as forward evolution of an augmented dynamical system, detaching intermediate histories to eliminate memory scaling with sequence length.
  • Empirical results show TOSE reduces VRAM usage by up to 94% and improves throughput significantly, making it ideal for large-scale applications like genomics.

Tiled Operator-Space Evolution (TOSE) is a core algorithmic primitive for O(1)-memory exact analytical differentiation through linear state-space models (SSMs), enabling efficient sensitivity analysis for large-scale sequence modeling tasks. TOSE reframes the gradient computation as the forward evolution of an augmented dynamical system, exploiting the associative structure of operator evolution and leveraging a tiling and scan-based protocol to eliminate the memory overhead inherent in conventional backpropagation. TOSE supports applications such as chromosome-scale modeling in genomics, where conventional Autograd approaches encounter prohibitive memory costs.

1. Theoretical Basis and Role in Phase Gradient Flow

TOSE is underpinning the Phase Gradient Flow (PGF) framework, which addresses the memory bottleneck in gradient-based sensitivity analysis for large SSMs. PGF interprets backpropagation through a linear recurrence as the forward evolution of a “tangent” dynamical system. For a discrete-time SSM defined by

xt+1=Axt+But,yt=Cxt+Dut,x_{t+1} = A x_t + B u_t,\qquad y_t = C x_t + D u_t,

the Fréchet derivative product yt=DF[u]ut\nabla y_t = D\mathcal{F}[u] \cdot \nabla u_t evolves under a linear recurrence isomorphic to the original dynamics.

TOSE leverages this dynamical isomorphism by tiling the sequence into fixed-size blocks. Within each tile, both the primal and tangent systems are evolved via a parallel scan over associative operator products. The computational graph is detached at tile boundaries, resulting in memory usage that does not scale with sequence length LL. This enables strictly O(1) differentiation graph memory in LL (Wang et al., 28 Dec 2025).

2. Mathematical Foundation

2.1 Primal and Tangent Recursions

Given xtRnx_t \in \mathbb{R}^n, ut,ytRdu_t, y_t \in \mathbb{R}^d and allowing for diagonal or block-diagonal, possibly time-varying system matrices (as in Mamba/S4), the tangent dynamics for sensitivity analysis (the variation xt:=xt/uu\nabla x_t := \partial x_t / \partial u \cdot \nabla u) satisfy

xt+1=Axt+(A/utut)xt+(B/utut)ut+But,\nabla x_{t+1} = A \nabla x_t + (\partial A / \partial u_t \cdot \nabla u_t)x_t + (\partial B / \partial u_t \cdot \nabla u_t)u_t + B \nabla u_t,

yt=Cxt+Dut,\nabla y_t = C \nabla x_t + D \nabla u_t,

where, in the diagonal SSM setting, all derivatives and per-step operators remain element-wise linear.

2.2 Augmented Operator-Space

The triplet (xt,xt,1)(x_t, \nabla x_t, 1) is encoded as a single augmented vector, and an augmented block-matrix recurrence is defined: [xt xt 1]=Mt[xt1 xt1 1],\begin{bmatrix} x_t \ \nabla x_t \ 1 \end{bmatrix} = M_t \begin{bmatrix} x_{t-1} \ \nabla x_{t-1} \ 1 \end{bmatrix}, where each MtM_t is

Mt=[A0But KtAjt 001],M_t = \begin{bmatrix} A & 0 & B u_t \ K_t & A & j_t \ 0 & 0 & 1 \end{bmatrix},

with Kt=A/ututK_t = \partial A / \partial u_t \cdot \nabla u_t and jt=(B/utut)ut+Butj_t = (\partial B / \partial u_t \cdot \nabla u_t)u_t + B \nabla u_t.

The associativity Mt+1Mt=Mt:t+1M_{t+1} \cdot M_t = M_{t:t+1} allows parallel prefix scan computation of these operator products. The diagonal structure reduces the computation, for each coordinate, to products of compact (3×33\times3) blocks.

3. Tiling and Parallel Scan Protocol

Sequence length LL is partitioned into tiles of size BB. For each tile kk spanning steps [(k1)B+1,  kB][(k-1)B+1,\; kB], the per-step operator MtM_t is constructed and a parallel scan (prefix product) is performed over the block. Only the state at the tile boundary is retained for subsequent computation, and intermediate autograd history within the tile is detached, yielding memory complexity independent of LL.

The TOSE protocol is formalized as follows:

Step Operation Memory Effect
a) Load block inputs ublock,  ublocku_{\text{block}},\; \nabla u_{\text{block}} No graph growth
b) Build per-step operators Construct MtM_t for tile Per-step, local
c) ParallelScan over MtM_t Compute [xout,xout][x_{\text{out}}, \nabla x_{\text{out}}] Intra-tile only
d) Output sensitivity yblock=Cxout+Dublock\nabla y_{\text{block}} = C \nabla x_{\text{out}} + D \nabla u_{\text{block}} Store external output
e) Boundary state detach xin,xinx_{\text{in}}, \nabla x_{\text{in}} updated via .detach() Limits recursive history

TOSE relies on an arbitrary associative parallel prefix-product implementation (e.g., tree-based scan) to combine per-step operators within each tile.

4. Memory Complexity and Differentiation Graph Architecture

By detaching computational history at tile boundaries, the memory footprint for automatic differentiation is proportional to B(dimx+dimx)B \cdot (\mathrm{dim}\, x + \mathrm{dim}\, \nabla x), a strict constant in LL as LL \rightarrow \infty. All input and output tensors (e.g., u,  u,  y,  yu,\; \nabla u,\; y,\; \nabla y) retain linear scaling in LL, but these are not part of the recursive differentiation graph. TOSE thus guarantees strictly O(1) memory scaling for the differentiation graph with respect to sequence length.

5. Numerical Stability in Stiff ODE Regimes

Conventional parallel scan algorithms exhibit catastrophic underflow or overflow for very stiff ODEs (e.g., AtA_t with eigenvalues exp(1000)\exp(-1000)) in finite precision arithmetic. TOSE incorporates a "log-shifting stabilizer," normalizing intermediate products within each tile by subtracting the local maximum log-magnitude prior to exponentiation and restoring the shift post-scan. Analytical error analysis demonstrates that the relative error after LL steps is bounded by

errorCϵmach,\|\text{error}\| \leq C \epsilon_{\mathrm{mach}},

with CC governed by the spectral radius of AA (1\leq 1 in typical scenarios). Empirical regression of error versus LL shows no slope (p0.05p \gg 0.05), confirming no error accumulation with increasing sequence length. This distinguishes TOSE from naive parallel prefix schemes, which fail under extreme stiffness for large LL (Wang et al., 28 Dec 2025).

6. Empirical Performance and Scalability

6.1 Memory Utilization

Standard Autograd backpropagation incurs O(Lnd)O(L \cdot n \cdot d) activation storage; at L=100,000L=100{,}000, a single layer exceeds $10$ GB VRAM. TOSE maintains graph memory at a constant level irrespective of LL. Across L[103,105]L \in [10^3,\, 10^5], PGF achieves a 94%94\% reduction in peak VRAM usage on NVIDIA RTX 5090 hardware.

6.2 Throughput

On an RTX 5060 Laptop GPU, PGF demonstrates 11.9×11.9\times throughput improvement over reverse-mode Autograd at L=10,000L=10{,}000, and sustains 23×23\times higher throughput for L128,000L \approx 128{,}000 (“chromosome scale”). This efficiency was evaluated via samples per second (see Figures 1 and 3 in (Wang et al., 28 Dec 2025)).

6.3 Sensitivity Analysis Benchmarks

In a $128,000$-step impulse-response (“Ghost Pulse”) benchmark, TOSE precisely recovers micro-perturbations at late timesteps (t=100,000t=100{,}000) without numerical leakage, a regime where Autograd encounters out-of-memory failure in multi-layered models.

7. Applications and Significance

TOSE enables feasible genome-scale sensitivity analysis (L>105L > 10^5) on single 8–12 GB GPUs. This bridges the practical gap between infinite-context theoretical models and existing hardware constraints. The protocol is applicable to any linear SSM and directly supports models with diagonal or block-diagonal system matrices, including Mamba and S4 variants.

A plausible implication is that TOSE could generalize to other operator evolution domains beyond SSMs, provided they admit associative and scan-friendly forms. The minimal error accumulation and strict memory guarantees position TOSE as a foundational tool for large-scale differentiable modeling, particularly in genomics, signal processing, and other sequence analysis contexts where computational memory was previously prohibitive (Wang et al., 28 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Tiled Operator-Space Evolution (TOSE).