Streaming Tensor Program (STeP)

Updated 18 November 2025

STeP is a formal abstraction that represents high-dimensional tensors as streams, enabling dynamic, data-dependent computations on spatial dataflow hardware.
It introduces specialized operators (Partition, Reassemble, EagerMerge) to handle irregular tensor shapes and optimize memory usage and parallel execution.
Empirical results demonstrate significant gains in on-chip memory reduction and latency improvements, driven by dynamic tiling, load balancing, and compiler-driven fusion.

A Streaming Tensor Program (STeP) is a formal abstraction for representing, analyzing, and optimizing programs that process high-dimensional tensor data in a streaming or highly dynamic fashion. STeP unifies principled stream-typed representations for dynamic tensor workloads, advanced memory- and dataflow-aware operators, and end-to-end compiler/runtime pipelines for efficient execution on spatial dataflow hardware. STeP techniques are central in fields ranging from large-scale scientific data compression and LLM hardware acceleration to scalable decomposition algorithms for streaming, distributed, or massive tensor data.

1. Core Abstraction and Formal Semantics

STeP represents all tensor data as streams: sequences of rank‑ $a$ tensor elements $(T)$ , where the stream shape $\mathrm{shape}_\mathrm{stream} = [D_a, D_{a-1}, \dots, D_0]$ may include static-regular, dynamic-regular, or ragged (absorbing) dimensions. Stop tokens $S_N$ delineate logical tensor boundaries; e.g., $S_1$ for vectors, $S_2$ for matrices. Crucially, stream types are parameterized by symbolic shape semantics: the length and structure of tensor dimensions are represented as symbolic expressions, enabling propagation of dynamic shape information through program graphs.

Elements of the stream can be fixed or runtime-sized tiles (e.g., shape $[t_m, t_n]$ ), selectors (multi-hot vectors) for dynamic routing, or pointers to on-chip buffers. The STeP type system explicitly encodes the memory hierarchy (off-chip DRAM, on-chip banks, local PE scratchpads) and annotates data movement, allowing precise symbolic tracking of memory footprints and data rates. Each operator (load/store, bufferize, compute, etc.) emits symbolic footprints, e.g.,

$M_\mathrm{onchip} = |\mathrm{dtype}| + 2 \cdot \|\mathrm{buf}\| \cdot |\mathrm{dtype}| \quad \text{(double-buffered)}$

A single STeP graph thus describes program semantics, dataflow, and performance bounds across all possible runtime tensor shapes and rates (Sohn et al., 11 Nov 2025).

2. High-Dimensional Streaming and Dynamic Parallelism

The abstraction of streaming tensor programs is especially powerful for workloads exhibiting runtime-dynamic behavior, such as variable batch sizes, ragged tensors, or data-dependent control flow. STeP introduces flexible routing and merging operators:

Partition $_{a,b}$ : routes chunks of $b$ inner dims of an input rank‑ $a$ stream to $N$ outputs, determined by a multi-hot selector. The count of routed chunks for each output is exposed as a dynamic dimension.
Reassemble $_{a,b}$ : concatenates inner $b$ dims from input streams into a higher-rank stream, supporting ragged or irregular concatenation.
EagerMerge $_a$ : greedily dequeues data from any ready input stream $i$ , tagging each chunk with a selector index, enabling true dynamic load balancing.

These operators enable efficient expression of complex, data-dependent parallelization patterns not possible in static dataflow representations. For example, in autoregressive Grouped Query Attention (GQA), varying KV-cache lengths across batch elements would cause severe load imbalance in static batch-parallel execution, but can be handled optimally in STeP by integrating Partition, EagerMerge, and feedback subgraphs (Sohn et al., 11 Nov 2025).

3. Compiler Realizations and Type-Driven Fusion

Compiler frameworks like StreamTensor (Ye et al., 17 Sep 2025) realize STeP abstractions via an explicit iterative tensor type system. Each stream-typed tensor is annotated:

Element shape $D$ ,
Iteration domain $I$ ,
Affine iteration–to–data mapping $\varphi$ .

Dataflow kernels are fused, bufferized, or vectorized according to stream type compatibility. Where types differ, minimal converters are automatically synthesized, minimizing ping-pong buffer sizes and reconciling layout differences by analytical comparison of shape and mapping prefixes.

The multi-level compiler pipeline operates as follows:

PyTorch model → MLIR Linalg IR → tiled, stream-typed dataflow IR → kernel fusion (type-directed) → buffer converter/DMA insertion → resource allocation (FIFO, buffer sizing via token/LP models) → HLS/FPGA backend.
Hierarchical design-space exploration optimizes tile shapes, loop unrolling, kernel fusion groups, and memory allocation—guided by symbolic stream types and scheduling constraints (Ye et al., 17 Sep 2025).

Automation of each stage is driven by explicit stream typing and cost models, avoiding brittle heuristic buffer sizing and system-level deadlocks.

4. Optimizations: Dynamic Tiling, Parallelization, and Time-Multiplexing

Exposing symbolic shape and memory semantics enables optimizations otherwise inexpressible in static compiler models:

Dynamic Tiling: In MoE workloads, tokens-per-expert may be highly uneven. STeP accumulates all tokens for an expert into a single dynamic tile, minimizing padding and memory, and ensuring each expert’s weights need to be loaded only once rather than per static tile. Empirically, this reduced on-chip memory requirements by $2.18\times$ and improved latency by $1.2\times$ on LLM layers compared to the best static tiling (Sohn et al., 11 Nov 2025).
Dynamic Parallelization: In autoregressive decoding, requests with variable KV-cache lengths are dynamically load-balanced across PEs using EagerMerge feedback loops, achieving $1.5\times$ latency speedup vs. static parallelization (Sohn et al., 11 Nov 2025).
Configuration Time-Multiplexing: For large MoEs ( $E \gg M$ experts), STeP dynamically funnels tokens from all experts into a small set of $M$ pipelines, dynamically loading weights as needed. This increased compute utilization by $2.57\times$ vs. static allocation, with negligible latency overhead (Sohn et al., 11 Nov 2025).

These optimizations are consistently validated across LLM stacks (Qwen3-30B-A3B, Mixtral8x7B, etc.) and are realized in simulation and on FPGA prototypes (Sohn et al., 11 Nov 2025, Ye et al., 17 Sep 2025).

5. Dynamic Tensor Factorization in Streaming Contexts

STeP also refers more broadly to streaming algorithms for multi-way tensor decomposition. These are critical for scientific data compression, statistical modeling, and latent structure discovery at streaming scale.

Streaming Tucker Programs: Algorithms such as the streaming Tucker decomposition (De et al., 2023) initialize on an initial data batch, then incrementally update factor matrices and the core tensor per slice, achieving memory usage nearly independent of the number of time steps and reducing per-slice complexity via ISVD updates. These approaches preserve Frobenius-norm approximation guarantees and significantly lower memory costs compared to batch algorithms, enabling in-situ scientific simulation compression.
Streaming CP and GCP Decompositions: Streaming Generalized Canonical Polyadic factorization (OnlineGCP) incrementally updates CP factors using segregated stochastic optimization with reservoir-sampled historical windows, tunable forgetting, and highly structured memory/resource use (Phipps et al., 2021).
Streaming Tensor Train: Algorithms such as STTA process high-dimensional tensors without full materialization, using two-sided random projection sketches to produce optimal TT decompositions with sharp error guarantees and effective one-pass performance (Kressner et al., 2022).
Streaming Coresets for Symmetric Tensor Factorization: These algorithms maintain sublinear-size coresets with online filtering and kernelization, providing $(1\pm\epsilon)$ guarantees for $p$ -moment tensor approximations and ensuring rigorous spectral approximation with provable bounds for model learning (Chhaya et al., 2020).

Table: Key STeP Algorithmic Variants

Variant	Structure	Memory Scaling
STeP (dataflow)	Symbolic streaming graph	Symbolic in $D_i$
Streaming Tucker	Incremental core/factor	$O(\sum I_n R_n + \prod R_n)$
Streaming TT (STTA)	Two-sided sketching	$O(d n r^2)$
Online GCP/CP	Stochastic segregated	$O(H R^2 + \sum I_k R)$
Streaming Coreset	Online leverage sampling	$O(d^2)$ (matrix)

6. Empirical Results and Implementation Trade-Offs

Empirical validation on realistic hardware simulators (cycle-approximate dataflow simulators, Bluespec+Ramulator2), FPGAs, and distributed frameworks demonstrates:

STeP dynamic tiling: $2.18\times$ less on-chip memory, $1.2\times$ latency gain (Sohn et al., 11 Nov 2025).
Dynamic parallelization: $1.5\times$ speedup over statically scheduled dataflow (Sohn et al., 11 Nov 2025).
Multiplexed MoE execution: $2.57\times$ utilization boost for $M=32$ vs. $E=128$ experts (Sohn et al., 11 Nov 2025).
StreamTensor on LLMs: up to $0.76\times$ latency, $1.99\times$ tokens/J advantage over GPU baselines with prioritized FIFO sizing and hierarchical IR-leveraged kernel fusion (Ye et al., 17 Sep 2025).
Streaming Tucker: memory use reduced from $600$ MB to $20$ MB for synthetic ( $100\times100\times5000$ ) and $39$ GB to $22$ GB for HCCI ( $672\times672\times33\times268$ ) with comparable approximation quality (De et al., 2023).
Streaming TT: empirical errors within factor of TT-SVD, optimal for structured or mixed-form tensors; scaling in $d$ matches $\sqrt{d}$ behavior of classical TT-SVD (Kressner et al., 2022).
Online GCP: scalable to tens of thousands of slices on multi-core/GPU, maintaining local/global loss parity with batch methods even for real, sparse, and non-Gaussian tensor streams (Phipps et al., 2021).

Limitations include symbolic shape models providing only upper/lower bounds in extremal cases with highly ragged data, potential overhead from dynamic memory mapping, and added control logic in hardware (e.g., FIFO controller modifications for stop tokens). For coreset-based STeP, sensitivity estimation and kernelization may incur higher per-update cost for high $p$ , but yield rigorous core-approximation guarantees (Chhaya et al., 2020).

7. Future Directions and Extensions

Promising extensions include:

Automated compiler passes for optimal dynamic tiling and parallelization.
SDA ASIC/FPGA co-design to directly support STeP stream and routing primitives.
Expansion to non-ML domains, such as streaming graph analytics or signal processing.
Integration with shape inference, just-in-time hardware reconfiguration, and fine-grained memory management for fully adaptive streaming tensor computation.
Deeper hierarchical IRs and type systems for unified symbolic performance analysis, efficient bufferization, and end-to-end automated deployment from high-level frameworks to hardware (Sohn et al., 11 Nov 2025, Ye et al., 17 Sep 2025).

By making shape, data rates, and the explicit memory hierarchy a part of the program semantics, STeP enables dynamic workloads to be handled with performance and rigor previously achievable only in static and regular dataflow, bridging the gap between real-world dynamic tensor computation and hardware-efficient execution.

References

"Streaming Tensor Program: A streaming abstraction for dynamic parallelism" (Sohn et al., 11 Nov 2025)
"StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs" (Ye et al., 17 Sep 2025)
"Efficient Computation of Tucker Decomposition for Streaming Scientific Data Compression" (De et al., 2023)
"Streaming Tensor Train Approximation" (Kressner et al., 2022)
"Streaming Generalized Canonical Polyadic Tensor Decompositions" (Phipps et al., 2021)
"Streaming Coresets for Symmetric Tensor Factorization" (Chhaya et al., 2020)