Triangular Input Movement (TrIM)

Updated 13 February 2026

Triangular Input Movement (TrIM) is a dataflow paradigm that exploits a triangular routing pattern to achieve near-optimal local reuse of input activations in a K×K grid.
It significantly reduces off-chip memory traffic and register footprint, achieving up to 9× lower DRAM accesses and 16× higher throughput compared to traditional dataflows.
The 3D-TrIM variant enhances efficiency with shadow registers and shared buffers, leading to improved area and energy metrics in deep learning accelerator designs.

Triangular Input Movement (TrIM) is a dataflow and architectural paradigm for spatial computing, proposed to address the Von Neumann bottleneck and data redundancy in convolutional workloads, notably in Systolic Arrays (SAs) for deep learning accelerators. TrIM achieves near-optimal local reuse of input feature map activations by exploiting a triangular routing pattern within a K×K grid of processing elements (PEs), yielding substantial reductions in off-chip memory accesses, register footprint, and on-chip memory energy compared to traditional dataflows. The 3D-TrIM variant introduces shadow registers and buffer sharing for further efficiency in both area and memory traffic. TrIM's core principles, quantitative analysis, and practical hardware impact have been developed and validated in a series of works (Sestito et al., 2024, Sestito et al., 2024, Sestito et al., 26 Feb 2025).

1. Motivation: Bottleneck Analysis and Rationale

The major design impetus behind TrIM is the high cost of off-chip memory access—typically $10^2$ – $10^3\times$ greater than a multiply-accumulate (MAC) in terms of energy—which dominates the system power budget for AI and CNN accelerators. Standard 2D convolutions process an input feature map (ifmap) of dimensions $H_I \times W_I$ with $C_I$ channels, $F_O$ filters, and $K\times K$ kernels via a sliding window pattern. Naïve implementations load input activations redundantly, as each $K\times K$ window overlaps adjacent windows on $K-1$ rows and columns. GEMM-based mappings exacerbate this redundancy due to “im2col” packing, while other dataflows such as weight-stationary (WS) and row-stationary (RS) partially mitigate but do not eliminate redundant traffic or buffer overheads (Sestito et al., 2024).

2. TrIM Dataflow Structure and Operation

TrIM organizes processing in a K×K two-dimensional systolic slice, where each PE holds a stationary weight and receives a dynamic flow of input activations and partial sums. The input delivery pattern—termed triangular input movement—is defined as follows:

During each cycle, activations enter at the top-right corner, propagate left across each row (horizontal reuse), and, upon reaching the row end, shift diagonally to the leftmost PE of the next lower row (diagonal reuse via a shift-register buffer).
This pattern ensures each activation is consumed by all $K$ PEs in its row and then by each row below, with a total potential of $K$ uses per activation.
Partial sums are accumulated by downward flow through PE columns and finalized via a small adder tree beneath each slice (Sestito et al., 2024, Sestito et al., 2024).

A PE's role is summarized by:

Storing one weight,
Accumulating partial sum from above,
Registering the current input,
Routing the input leftward or into the shift-buffer, as dictated by the triangular path.

The result is a right-triangle of computation, representing full utilization of each loaded activation across a $K\times K$ MAC domain.

3. Analytical Modeling and Quantitative Comparison

The TrIM dataflow is analytically characterized by three key metrics:

a) Local Input Utilization:

Each activation from off-chip memory serves approximately $K^2$ MACs, and for large maps, the input-reuse factor approaches $K^2$ . The utilization is calculated as:

$U_{in} = \frac{K^2 H_O W_O}{H_I W_I + OV}$

where $OV$ is the marginal boundary effect for incomplete triangles (Sestito et al., 2024).

b) Memory Traffic:

TrIM achieves significant reduction compared to WS and RS:

WS: $A_{WS}^{in} = K^2 H_I W_I M$
TrIM: $A_{TrIM}^{in} = H_I W_I M$ For $K=3$ , $A_{WS}^{in}/A_{TrIM}^{in} \approx 9\times$ ; ratios rise further for larger K. TrIM also eliminates the need for large per-PE SRAM scratch-pads inherent in RS, resulting in a $11-16\times$ reduction in hidden memory accesses (Sestito et al., 2024, Sestito et al., 2024).

c) Throughput:

TrIM achieves ideal per-PE throughput at $2$ OPs/cycle/PE, since multiplies and accumulates are perfectly overlapped.
RS throughput is capped at $2K/(2K-1)$ OPs/cycle/PE (e.g. $1.5$ for $K=3$ ), rendering TrIM up to $81.8\%$ faster at high $K$ .

A summary comparison for key metrics is shown:

Dataflow	DRAM Traffic (inputs)	On-chip Registers	Peak Throughput
WS	$K^2 H_I W_I M$	Moderate	Drops for small features
RS	$H_I W_I M + O(K)$	Large ($2K+1$ / PE)	Below TrIM
TrIM	$H_I W_I M$	Minimal (SRBs only)	$2$ OPs/cycle/PE

(Sestito et al., 2024, Sestito et al., 2024)

4. 3D-TrIM: Architectural Enhancements

3D-TrIM introduces two core improvements (Sestito et al., 26 Feb 2025):

Shadow Registers: Each Input Recycling Buffer (IRB) is supplemented with $(K-1)^2$ shadow registers to eliminate TrIM’s end-of-row extra memory traffic. The last $K-1$ pixels of each row, previously forcing memory rereads, are now captured and replayed locally.
Buffer Sharing and Array Organization: Instead of dedicating an IRB to each K×K slice (as in TrIM), 3D-TrIM shares a single IRB across $P_O$ slices. These slices process the same ifmap across parallel output filters, with spatial accumulation handled by a shared adder hierarchy.

These changes yield purely ideal activations loading:

$N_{\text{loads, 3D}} = \frac{H_I W_I C_I}{K}$

eliminating the reload penalty, and improve area efficiency by consolidating buffer resources. Empirically, 3D-TrIM achieves up to $3.37\times$ higher operations per memory access on CNNs with shrinking spatial dimensions (e.g., deep layers in VGG-16, AlexNet), and area/energy efficiency of $4.47$ TOPS/mm $^2$ and $4.54$ TOPS/W in 22 nm silicon (Sestito et al., 26 Feb 2025).

5. Hardware Realizations and Empirical Results

FPGA and ASIC implementations corroborate the model:

An FPGA-based TrIM engine on a Zynq UltraScale+ (XCZU7EV) comprises 1,512 LUT-only PEs, delivers $453.6$ GOPS at $150$ MHz, and realizes $5.1\times$ reduction in memory traffic and $16\times$ higher throughput over Eyeriss (RS-based) for VGG-16 (Sestito et al., 2024).
TrIM's on-chip register count is up to $15.6\times$ lower than RS for $K=7$ , $I=256$ (Sestito et al., 2024).
3D-TrIM in 22nm CMOS achieves $1.15$ TOPS at $0.26$ mm $^2$ and $0.25$ W ($4.47$ TOPS/mm $^2$ , $4.54$ TOPS/W) (Sestito et al., 26 Feb 2025).

These results generalize across parallelization parameters (number of slices/cores/filters), and TrIM's advantages remain consistent across diverse CNN topologies and layer widths.

6. Limitations, Trade-offs, and Future Directions

TrIM’s benefits must be assessed in the context of certain trade-offs:

Kernel Size Flexibility: Arrays must be matched or reconfigurable for the largest $K$ ; scaling to variable or large $K$ increases buffer and routing overhead (Sestito et al., 2024).
SRB Depth: Grows with increasing input feature map width $W_I$ ; for extremely large maps, SRB (or IRB) sizing may become a constraint.
Idle Resource Utilization: For small numbers of input channels compared to available slices, some PEs may remain under-utilized.
Sparsity: In presence of high activation/weight sparsity, TrIM (in its original form) does not skip zeros intra-slice; dataflows specialized for sparse matrices may prevail.
Routing Complexity: Diagonal activations require some increase in interconnect resources, but remain localized along array edges (Sestito et al., 2024).

Identified directions include dynamic array reconfiguration for mixed kernel sizes, tiling strategies for extremely large feature maps, power/energy modeling of buffer/route structures, and ASIC-optimized implementations leveraging the buffer-efficient hierarchy of TrIM.

7. Significance and Broader Context

TrIM is distinguished from prior systolic array dataflows by its K-fold input reuse and avoidance of large per-PE SRAM scratch-pads. It enables substantial reductions—order-of-magnitude or better—in DRAM traffic, on-chip register count, and energy, while achieving or exceeding the throughput of conventional arrays. The 3D-TrIM evolution demonstrates the extensibility of the paradigm to further dimensions of parallelism and buffer sharing, providing a scalable template for high-density, energy-optimized AI accelerator design. TrIM's analytic modeling, open hardware implementation guidance, and comparative benchmarks set a clear standard for future work in spatial convolutional architectures (Sestito et al., 2024, Sestito et al., 2024, Sestito et al., 26 Feb 2025).