TOSE: Tiled Operator-Space Evolution in SSMs

Updated 4 January 2026

TOSE is a core algorithmic primitive that achieves exact analytical differentiation in linear state-space models with strict O(1) memory usage via tiling.
It reframes gradient computation as forward evolution of an augmented dynamical system, detaching intermediate histories to eliminate memory scaling with sequence length.
Empirical results show TOSE reduces VRAM usage by up to 94% and improves throughput significantly, making it ideal for large-scale applications like genomics.

Tiled Operator-Space Evolution (TOSE) is a core algorithmic primitive for O(1)-memory exact analytical differentiation through linear state-space models (SSMs), enabling efficient sensitivity analysis for large-scale sequence modeling tasks. TOSE reframes the gradient computation as the forward evolution of an augmented dynamical system, exploiting the associative structure of operator evolution and leveraging a tiling and scan-based protocol to eliminate the memory overhead inherent in conventional backpropagation. TOSE supports applications such as chromosome-scale modeling in genomics, where conventional Autograd approaches encounter prohibitive memory costs.

1. Theoretical Basis and Role in Phase Gradient Flow

TOSE is underpinning the Phase Gradient Flow (PGF) framework, which addresses the memory bottleneck in gradient-based sensitivity analysis for large SSMs. PGF interprets backpropagation through a linear recurrence as the forward evolution of a “tangent” dynamical system. For a discrete-time SSM defined by

$x_{t+1} = A x_t + B u_t,\qquad y_t = C x_t + D u_t,$

the Fréchet derivative product $\nabla y_t = D\mathcal{F}[u] \cdot \nabla u_t$ evolves under a linear recurrence isomorphic to the original dynamics.

TOSE leverages this dynamical isomorphism by tiling the sequence into fixed-size blocks. Within each tile, both the primal and tangent systems are evolved via a parallel scan over associative operator products. The computational graph is detached at tile boundaries, resulting in memory usage that does not scale with sequence length $L$ . This enables strictly O(1) differentiation graph memory in $L$ (Wang et al., 28 Dec 2025).

2. Mathematical Foundation

2.1 Primal and Tangent Recursions

Given $x_t \in \mathbb{R}^n$ , $u_t, y_t \in \mathbb{R}^d$ and allowing for diagonal or block-diagonal, possibly time-varying system matrices (as in Mamba/S4), the tangent dynamics for sensitivity analysis (the variation $\nabla x_t := \partial x_t / \partial u \cdot \nabla u$ ) satisfy

$\nabla x_{t+1} = A \nabla x_t + (\partial A / \partial u_t \cdot \nabla u_t)x_t + (\partial B / \partial u_t \cdot \nabla u_t)u_t + B \nabla u_t,$

$\nabla y_t = C \nabla x_t + D \nabla u_t,$

where, in the diagonal SSM setting, all derivatives and per-step operators remain element-wise linear.

2.2 Augmented Operator-Space

The triplet $(x_t, \nabla x_t, 1)$ is encoded as a single augmented vector, and an augmented block-matrix recurrence is defined: $\begin{bmatrix} x_t \ \nabla x_t \ 1 \end{bmatrix} = M_t \begin{bmatrix} x_{t-1} \ \nabla x_{t-1} \ 1 \end{bmatrix},$ where each $M_t$ is

$M_t = \begin{bmatrix} A & 0 & B u_t \ K_t & A & j_t \ 0 & 0 & 1 \end{bmatrix},$

with $K_t = \partial A / \partial u_t \cdot \nabla u_t$ and $j_t = (\partial B / \partial u_t \cdot \nabla u_t)u_t + B \nabla u_t$ .

The associativity $M_{t+1} \cdot M_t = M_{t:t+1}$ allows parallel prefix scan computation of these operator products. The diagonal structure reduces the computation, for each coordinate, to products of compact ( $3\times3$ ) blocks.

3. Tiling and Parallel Scan Protocol

Sequence length $L$ is partitioned into tiles of size $B$ . For each tile $k$ spanning steps $[(k-1)B+1,\; kB]$ , the per-step operator $M_t$ is constructed and a parallel scan (prefix product) is performed over the block. Only the state at the tile boundary is retained for subsequent computation, and intermediate autograd history within the tile is detached, yielding memory complexity independent of $L$ .

The TOSE protocol is formalized as follows:

Step	Operation	Memory Effect
a) Load block inputs	$u_{\text{block}},\; \nabla u_{\text{block}}$	No graph growth
b) Build per-step operators	Construct $M_t$ for tile	Per-step, local
c) ParallelScan over $M_t$	Compute $[x_{\text{out}}, \nabla x_{\text{out}}]$	Intra-tile only
d) Output sensitivity	$\nabla y_{\text{block}} = C \nabla x_{\text{out}} + D \nabla u_{\text{block}}$	Store external output
e) Boundary state detach	$x_{\text{in}}, \nabla x_{\text{in}}$ updated via .detach()	Limits recursive history

TOSE relies on an arbitrary associative parallel prefix-product implementation (e.g., tree-based scan) to combine per-step operators within each tile.

4. Memory Complexity and Differentiation Graph Architecture

By detaching computational history at tile boundaries, the memory footprint for automatic differentiation is proportional to $B \cdot (\mathrm{dim}\, x + \mathrm{dim}\, \nabla x)$ , a strict constant in $L$ as $L \rightarrow \infty$ . All input and output tensors (e.g., $u,\; \nabla u,\; y,\; \nabla y$ ) retain linear scaling in $L$ , but these are not part of the recursive differentiation graph. TOSE thus guarantees strictly O(1) memory scaling for the differentiation graph with respect to sequence length.

5. Numerical Stability in Stiff ODE Regimes

Conventional parallel scan algorithms exhibit catastrophic underflow or overflow for very stiff ODEs (e.g., $A_t$ with eigenvalues $\exp(-1000)$ ) in finite precision arithmetic. TOSE incorporates a "log-shifting stabilizer," normalizing intermediate products within each tile by subtracting the local maximum log-magnitude prior to exponentiation and restoring the shift post-scan. Analytical error analysis demonstrates that the relative error after $L$ steps is bounded by

$\|\text{error}\| \leq C \epsilon_{\mathrm{mach}},$

with $C$ governed by the spectral radius of $A$ ( $\leq 1$ in typical scenarios). Empirical regression of error versus $L$ shows no slope ( $p \gg 0.05$ ), confirming no error accumulation with increasing sequence length. This distinguishes TOSE from naive parallel prefix schemes, which fail under extreme stiffness for large $L$ (Wang et al., 28 Dec 2025).

6. Empirical Performance and Scalability

6.1 Memory Utilization

Standard Autograd backpropagation incurs $O(L \cdot n \cdot d)$ activation storage; at $L=100{,}000$ , a single layer exceeds $10$ GB VRAM. TOSE maintains graph memory at a constant level irrespective of $L$ . Across $L \in [10^3,\, 10^5]$ , PGF achieves a $94\%$ reduction in peak VRAM usage on NVIDIA RTX 5090 hardware.

6.2 Throughput

On an RTX 5060 Laptop GPU, PGF demonstrates $11.9\times$ throughput improvement over reverse-mode Autograd at $L=10{,}000$ , and sustains $23\times$ higher throughput for $L \approx 128{,}000$ (“chromosome scale”). This efficiency was evaluated via samples per second (see Figures 1 and 3 in (Wang et al., 28 Dec 2025)).

6.3 Sensitivity Analysis Benchmarks

In a $128,000$-step impulse-response (“Ghost Pulse”) benchmark, TOSE precisely recovers micro-perturbations at late timesteps ( $t=100{,}000$ ) without numerical leakage, a regime where Autograd encounters out-of-memory failure in multi-layered models.

7. Applications and Significance

TOSE enables feasible genome-scale sensitivity analysis ( $L > 10^5$ ) on single 8–12 GB GPUs. This bridges the practical gap between infinite-context theoretical models and existing hardware constraints. The protocol is applicable to any linear SSM and directly supports models with diagonal or block-diagonal system matrices, including Mamba and S4 variants.

A plausible implication is that TOSE could generalize to other operator evolution domains beyond SSMs, provided they admit associative and scan-friendly forms. The minimal error accumulation and strict memory guarantees position TOSE as a foundational tool for large-scale differentiable modeling, particularly in genomics, signal processing, and other sequence analysis contexts where computational memory was previously prohibitive (Wang et al., 28 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Tiled Operator-Space Evolution (TOSE).