Weight Stationary Dataflow in DNN Accelerators

Updated 7 March 2026

Weight Stationary Dataflow is a technique that pins filter weights locally in processing elements to maximize reuse in convolution and matrix multiplication operations.
It minimizes DRAM bandwidth usage by loading weights only once and reusing them across multiple MAC operations, ideal for FPGAs, ASICs, and 3D-integrated systems.
Designs using WS dataflow leverage multi-level tiling and static scheduling to optimize memory traffic and computation balance for efficient DNN inference.

Weight-Stationary (WS) dataflow is a canonical architectural strategy for mapping convolutional and matrix multiplication workloads onto spatial accelerators. In WS dataflow, a subset or all of the filter weights are "pinned" locally within each processing element (PE)—implemented as registers, scratchpads, or local SRAM—such that each weight is fetched from off-chip memory only once and then reused for as many multiply-accumulate (MAC) operations as possible. Activations are streamed into the PE array, and partial sums are accumulated as the computation proceeds. This scheme is widely employed in FPGAs, ASICs (e.g., TPUs), and advanced 3D-integrated arrays, especially for deep neural network (DNN) inference where minimizing DRAM bandwidth and maximizing local weight reuse are critical to energy and throughput efficiency (Li, 13 May 2025, Shukla et al., 2024, Yin et al., 25 Feb 2025, Zhou et al., 2023, Elbtity et al., 2024).

1. Formal Definition and Computational Model

The WS dataflow is formally defined for a convolutional layer with:

$B$ : Batch size
$C$ : Input channels
$K$ : Output channels
$H,W$ : Input feature map height, width
$R,S$ : Filter height, width
$H',W'$ : Output height, width ( $H' = H - R + 1,\; W' = W - S + 1$ )

The convolution operation is:

$Y[k,y,x,b] = \sum_{c=0}^{C-1} \sum_{r=0}^{R-1} \sum_{s=0}^{S-1} W[k,c,r,s] \cdot X[c,y+r,x+s,b]$

In WS, each $PE_i$ is assigned a tile $W[k,c,r,s]$ (often along $k$ ) which it stores locally for the entire computation of one or more output channels. All relevant activations $X$ and running partial sums $PS$ stream through the network of PEs.

Total weights: $|W| = C \cdot K \cdot R \cdot S$
Total MACs: $|W| \cdot H' \cdot W' \cdot B$
Weight reuse factor: $\mathrm{Reuse}_W = H' \cdot W' \cdot B$
Weight off-chip traffic: $|W| \cdot \mathrm{sizeof(data\_type)}$ (ideally one fetch per weight)
Activation and partial sum traffic: Typically higher, determined by tiling and PE sharing pattern

Hardware models such as MAESTRO and Timeloop are used to quantify utilization, bandwidth, and the balance of memory-vs-compute walls, with the core metric for WS being maximized local reuse of the filter weights (Li, 13 May 2025).

2. Architecture and Dataflow Patterns

2.1 FPGA and Systolic Arrays

In FPGA and systolic array implementations, WS is realized by statically loading each PE with its assigned weight tile. A multi-level buffering hierarchy exploits on-chip memories—including registers, LUTRAM, BRAM, and, for large-scale designs, HBM—for both weights and activations. The deepest level (registers, LUTRAM) is used to pin weights for maximum reuse, while activations and partial sums are double-buffered or streamed (Li, 13 May 2025, Shukla et al., 2024).

A typical mapping: the 2D PE array is organized such that each row or column of PEs holds the weight parameters for a single output or input tile. Activations are broadcast horizontally, while partial sums flow vertically. In monolithic 3D architectures (e.g., WS-MONO3D), resistive RAM (RRAM) and SRAM are stacked with the logic tier using dense vertical vias, enabling simultaneous weight load and input multicast, further minimizing access latency and power (Shukla et al., 2024).

2.2 SIMT and SIMD Microarchitectures

On GPU and CPU vector architectures, WS maps one or more SIMD lanes to each weight or weight vector, holding them stationary across many activation and output updates. PacQ (Yin et al., 25 Feb 2025) partitions weight matrices into packed INT tiles, storing them in per-octet buffers to maximize their stationary reuse. Streaming activations are processed against these stationary weights using specialized FP-INT multipliers designed to exploit packing for efficient parallel computation.

On general-purpose CPUs, as shown in "YFlows" (Zhou et al., 2023), the WS dataflow iterates outermost over weight vectors. Once loaded into a SIMD register, a weight vector is reused across multiple input and output accesses before being evicted.

3. Pseudocode, Tiling, and Scheduling

The WS strategy is implemented using multi-level tiling to fit weights and partial sums within fast local memory, while activations are partitioned for streaming. The following pseudocode illustrates a two-level tiling WS schedule for FPGA-based convolution (Li, 13 May 2025):

For each output channel tile K_t:
    Load weight tile W_buf into BRAM
    For each spatial tile (H′_t x W′_t):
        Initialize partial sum buffer PS_buf
        For each input channel tile C_t:
            Load activation tile X_buf
            For each PE in K_t:
                For each local weight:
                    PS_buf[k, y', x'] += W_buf[k, ...] * X_buf[...]
        Write PS_buf back to output

Key tactics include double-buffering for activation tiles, deep pipelining (target initiation interval II=1), and loop unrolling across PE columns/rows proportional to available DSPs/LUTs. In advanced systems, tiling extends to exploit distinct memory levels: HBM for largest tiles, BRAM for intermediate, LUTRAM/registers for innermost loops.

4. Performance Modeling and Empirical Results

Performance and cost models for WS balance the benefits of minimized weight traffic against increased activation and partial sum movement:

Compute cycles: $T_\textrm{compute} = \mathrm{MAC} / (\mathrm{PE_{count}}\cdot f_\mathrm{clk} \cdot \mathrm{Util})$
External bandwidth per operand: e.g., $BW_W = \frac{|W|}{T_\textrm{compute}}$
Energy-delay product (EDP): $E_\textrm{total} = E_\textrm{MAC} \times \mathrm{MAC} + E_\textrm{BRAM} \times (\mathrm{reads}+\mathrm{writes}) + E_\textrm{DRAM} \times (\textrm{off-chip accesses})$
WS-MONO3D improvements (relative to 2D systolic, iso-area):
- Up to 47% lower inference latency
- Up to 40% lower EDP
- Up to 10× improvement in inferences/s/watt/mm² due to vertical stacking (Shukla et al., 2024)

In FPGA case studies, FINN [Umuroglu et al.] achieves >5 TOPS on Zynq-7020 with only LUT resources (no DSPs), while FINN-R [Blott et al.] achieves 12–50 TOPS on VU9P using quantized weights (1–8 bits) (Li, 13 May 2025).

On SIMT architectures, PacQ reports 1.99× speedup and 81% EDP reduction for LLM FFN layers by pinning packed INT4 weight tiles within local scratchpads and streaming FP16 activations (Yin et al., 25 Feb 2025).

5. Advantages, Limitations, and Comparative Dataflows

5.1 Advantages

Minimal weight bandwidth: Each weight is loaded once per computation, a critical asset when off-chip access is the system bottleneck (Li, 13 May 2025).
Simple control and high pipeline utilization: Regular data movement and statically scheduled PEs facilitate efficient hardware design with minimal stalling.
Suitability for low-precision/binary networks: High weight reuse aligns with the storage and compute characteristics of low-bitwidth FPGAs and binary accelerators (e.g., FINN) (Li, 13 May 2025).
Energy efficiency in emerging 3D ICs: By leveraging dense vertical integration, WS-MONO3D collapses both weight and activation movement, yielding substantial area-normalized energy and throughput gains (Shukla et al., 2024).

5.2 Limitations

Activation/psum traffic bottleneck: Activation reuse is limited. In large networks or when PE array is tall-and-narrow (e.g., many output channels, few input channels), repeated activation loads can dominate memory bandwidth (Li, 13 May 2025, Zhou et al., 2023).
Scaling with network size: On-chip weight storage can become prohibitive for large $C,K$ layers, requiring additional tiling and partial off-chip fetches, attenuating the bandwidth benefit (Li, 13 May 2025).
Mapping rigidity: Effective only when problem size maps cleanly to tiling/blocking strategy. Networks with significant irregularity or sparsity challenge load- and compute-balance (Li, 13 May 2025).
Suboptimal for general CPUs/TPUs at mid/deep layers: On SIMD CPUs, WS is often outperformed by output-stationary or hybrid flows due to poor arithmetic intensity and excessive register pressure (Zhou et al., 2023, Elbtity et al., 2024).

Comparative data suggest that output-stationary (OS) or hybrid-switching dataflow architectures (e.g., Flex-TPU) achieve higher performance in mid-to-late network layers or when accumulation locality is higher (Elbtity et al., 2024).

6. Advances, Reconfigurability, and Future Directions

Recent advances have moved beyond static WS, proposing hybrid or runtime-reconfigurable dataflow architectures. Flex-TPU allows layer-wise runtime switching between WS, OS, and IS flows, optimizing for layer shape and maximizing accelerator throughput with only slight area/power overhead. Empirical data on a 32×32 array report that WS is fastest for early ResNet-18 layers; OS then dominates; IS takes over in the narrowest/deepest layers (Elbtity et al., 2024).

In monolithic 3D chips, the WS-MONO3D approach utilizes dense vertical integration for simultaneous one-cycle weight and activation multicast, amplifying the WS benefit, but with significant challenges in thermal management and fabrication complexity (Shukla et al., 2024).

Motivated by the memory-compute trade-off, further research is moving toward:

Adaptive per-layer dataflow switching, guided by analyzer tools.
3D IC integration for bandwidth and area normalization.
Hardware-aware DNN architecture search, co-designing model and dataflow for optimal utilization in tight edge or data center envelopes.
Sparsity-aware and mixed-precision WS flows, leveraging locally stationary weights alongside dynamic tactics for activations.

7. Summary Table: WS Dataflow in Major Systems

System/Platform	Dataflow Mode	Notable Features / Results
FINN (Zynq-7020)	WS (binary)	Fully LUT-based, 5 TOPS, minimal DRAM BW
FINN-R (VU9P)	WS (1–8b quant.)	Auto-tiling, 12–50 TOPS, scalable BW/resource
PacQ (Tensor Core)	WS (FP16×INT4)	INT-packed weights pinned, 2× speedup, 81% EDP↓
WS-MONO3D (3D sys arr.)	WS	1-cycle multicasts, 40% EDP↓, 10× area eff.
Flex-TPU	Reconfigurable	Static WS, IS, OS; per-layer optimal switching

WS dataflow remains foundational in accelerator design where the DRAM read cost for weights is paramount and where on-chip weight reuse can be maximized. Modern and future designs increasingly hybridize WS with other flows or adaptively select the best strategy per layer, providing maximal efficiency across diverse DNN architectures (Li, 13 May 2025, Shukla et al., 2024, Yin et al., 25 Feb 2025, Zhou et al., 2023, Elbtity et al., 2024).