Temporal Tiling Strategies

Updated 12 November 2025

Temporal Tiling Strategy is a loop transformation that reorganizes iterations to compute multiple time steps per spatial tile, enhancing data reuse.
It applies skewing and polyhedral methods to align data dependencies, optimizing register usage and cache performance across diverse architectures.
Empirical studies demonstrate 20-27.5% speed-ups and up to 3× reduction in memory traffic in applications ranging from stencil PDE solvers to FPGA accelerators.

Temporal tiling strategy refers to a class of loop and data management transformations that exploit data locality across the time or scheduling dimension in iterative computations, notably for stencil-based PDE solvers, register optimization in loops, and high-throughput accelerator kernels. While spatial tiling (blocking) partitions data across space or memory footprint, temporal tiling organizes work such that multiple time steps or iteration groups are computed per spatial tile, maximizing reuse of cached or buffered data and thus reducing memory traffic or register pressure. This article presents a rigorous overview of temporal tiling strategies, drawing on recent advances in PDE solvers (Sim, 2018, McCormick, 2017), FPGA accelerators (Li, 13 May 2025), multicore stencil kernels (Malas et al., 2014), and compiler register management (Domagala et al., 2014).

1. Formal Definitions and Mathematical Foundations

Temporal tiling transforms the iteration space of a loop nest to simultaneously block across the temporal axis and the spatial (or "computation") axes. The iterative domain is generally parameterized as: $\mathcal{I} = \{ (t, x_1, ..., x_n) \mid 0 \leq t < T,\, 0 \leq x_i < N_i \}$ for a time dimension $t$ and $n$ -dimensional spatial domain.

Core principle: For stencil updates $u[t] = F(u[t-1, \mathcal{N}])$ , each element's update at time $t$ depends on a spatial neighborhood $\mathcal{N}$ from time $t-1$ ; thus, computing several $t$ 's within each spatial tile enables the reuse of spatial points before eviction from cache or buffer.

Skewing: To enable legal temporal tiling in the presence of data dependencies crossing tile boundaries, skewing is applied—spatial indices are redefined such that, e.g., $x' = x + \sigma t$ , with skew factor $\sigma \geq$ the maximum spatial stencil radius. This transformation aligns inter-iteration dependencies lexicographically within the tiled space, permitting legal loop interchange and tile ordering (Sim, 2018, McCormick, 2017).

Polyhedral formulation: Using the polyhedral model, one constructs affine transformations and strip-mines both temporal and spatial axes, generating tiling schedules that satisfy data dependencies and optimize data locality by maximizing the number of reused data elements per tile load (McCormick, 2017).

2. Parameterization and Implementation Techniques

Temporal tiling strategies introduce tunable parameters:

Parameter	Role in Temporal Tiling	Notes
$T_t$	Time-tile depth	Number of timesteps per tile; larger $T_t$ increases data reuse but also working-set size
$T_{x,y,z}$	Spatial tile sizes	Block length in each skewed spatial axis; balances cache/buffer capacity
$\sigma$	Skew factor	Typically set to maximum stencil radius; ensures dependencies are satisfied

In Devito, all tile sizes and skew factors are exposed as runtime parameters, and an auto-tuner searches over reasonable discrete values (e.g., powers of two) to maximize throughput (Sim, 2018). The skewing and tile bounds are constructed such that data dependencies, especially those with negative distances, are transformed into lexicographically forward directions.

Accelerator implementations (such as in FPGAs) treat the temporal tile as a group of iterations mapped onto a fixed set of PEs and buffer levels. For example, a temporal tile size $T_t$ may denote the number of output channels processed in sequence, with on-chip buffers sized for $T_c \times T_h \times T_w$ (activation), $T_c \times T_t$ (weights), and $T_t \times T_h \times T_w$ (outputs) (Li, 13 May 2025).

3. Arithmetic Intensity and Performance Modeling

Temporal tiling sharply increases the arithmetic intensity—the number of flops per byte of memory traffic—by amortizing memory loads over multiple time steps or iteration groups.

In 3D stencil computations, for spatial tile volume $V = T_x T_y T_z$ and arithmetic intensity estimator: $I_t = \frac{F T_t V}{B [2 r A_f + V ]}$ where $F$ is flops per grid point per timestep, $B$ is bytes accessed, $r$ is the stencil radius, and $A_f$ is the total area of tile faces that must be reloaded (Sim, 2018).

In the FPGA context, buffer reuse is analyzed using reuse factors: $R_w = \frac{T_h T_w T_c T_t}{T_c T_t} = T_h \times T_w$ as each weight word is used by $T_h T_w$ output activations before a new weight block is fetched (Li, 13 May 2025).

The roofline model then predicts obtained performance $P = \min\{P_\text{max}, I_t B_\text{max}\}$ , with attainable speed-up tied to whether arithmetic intensity is sufficient to saturate compute bandwidth or remains memory-bound (Sim, 2018, McCormick, 2017).

4. Loop-Nest Transformation and Code Generation

The canonical transformation of a four-dimensional loop (time and 3D space) into a temporally tiled, skewed nest is:

for (int b_t = 0; b_t < T; b_t += T_t)
  for (int b_x = 0; b_x < N_x + σ * b_t; b_x += T_x)
    for (int b_y = 0; b_y < N_y + σ * b_t; b_y += T_y)
      for (int b_z = 0; b_z < N_z + σ * b_t; b_z += T_z)
        for (int t = b_t; t < min(b_t+T_t, T); ++t)
          for (int x_ = b_x; x_ < min(b_x+T_x, N_x + σ*t); ++x_)
            for (int y_ = b_y; y_ < min(b_y+T_y, N_y + σ*t); ++y_)
              for (int z_ = b_z; z_ < min(b_z+T_z, N_z + σ*t); ++z_) {
                int x = x_ - σ * t;
                int y = y_ - σ * t;
                int z = z_ - σ * t;
                compute_stencil(u, t, x, y, z);
              }

Loop bounds and skewing ensure that all dependencies are satisfied and out-of-bounds accesses are prevented (Sim, 2018). This form features in Devito and in polyhedral code generators (e.g., CLooG or isl), which synthesize bounds and affine index corrections automatically (McCormick, 2017).

In register optimization (Domagala et al., 2014), a 2D tiling covers the instruction (statement) vs. iteration grid, with rectangular tiles of height (statements) and width (iterations), scheduled row-by-row within tiles and “just-in-time” loading/storing of data dependences at tile boundaries. A constraint programming formalism explores legal tile shapes, intra-tile schedules, and minimizes total memory spill/sync cost.

5. Extensions: Imperfect Nests, Multicore, and Heterogeneous Hardware

Temporal tiling extends beyond perfect hyperrectangular loop nests. For imperfect nests featuring conditionals, halo region updates, or irregular access patterns:

The core nested region with true data dependences is identified and temporally tiled; conditional or boundary code is left outside or tiled in $1\times\dots\times1$ units (Sim, 2018).
Stencil kernels on multicore CPUs benefit from diamond tiling in time-space (“wavefront diamond blocking”), where diamond-shaped tiles respect causality, and “thread groups” cooperatively update tiles, trading off concurrency vs. cache sharing (Malas et al., 2014).
In FPGAs, temporal tiling is tightly co-designed with memory hierarchies (HBM, BRAM, LUTRAM) and dataflow arrangements (weight-stationary, output-stationary, etc.), often employing double-buffering and deep pipelines to hide latency and amortize memory traffic (Li, 13 May 2025).

6. Empirical Results, Trade-Offs, and Best Practices

Temporal tiling consistently yields substantial reductions in runtime and external memory traffic:

In Devito, a 3D 7-point stencil with time-tiling ( $T_t=8$ , spatial tiles $\sim 64^3$ ) achieves $20$-- $25\%$ speed-up versus pure spatial tiling; for 13-point stencils, gains reach $27.5\%$ (Sim, 2018, McCormick, 2017).
On multicore CPUs, wavefront-diamond tiling delivers up to $3\times$ lower memory traffic and $2.5\times$ speed-up in memory-bound stencils, with energy-to-solution reductions of $50$-- $60\%$ (Malas et al., 2014).
For edge-AI FPGA accelerators, temporal tiling over output channels in FINN-style BNNs reduces external weight fetches $8\times$ (for $T_t=16$ vs. $T_t=128$ ), increases throughput by $30\%$ , and slightly increases latency due to pipeline depth (Li, 13 May 2025).
In register optimization, temporal tiling within the statement–iteration grid leads to $14.2\%$ average memory load reductions, and constrains register pressure below hardware limits in $92\%$ of cases previously overflowing (Domagala et al., 2014).

Key trade-offs involve choosing $T_t$ large enough to maximize reuse but small enough not to exceed buffer or register resources, maintaining pipeline depths compatible with latency targets, and aligning dataflow schedules to minimize idle compute units.

Best practices include:

Auto-tuning tile sizes for cache/buffer fit (Sim, 2018, McCormick, 2017).
Modeling resource footprints ahead of HLS synthesis in accelerator flows (Li, 13 May 2025).
Using double buffering to overlap data movement in deeply pipelined architectures.
Integrating tiling as a tunable parameter in cost models for compilers and hardware design flows (Li, 13 May 2025).

7. Outlook and Areas for Further Research

Advances in temporal tiling strategies continue to target:

Fully automated integration with polyhedral compilers (e.g., direct isl/CLooG binding in Devito DLE) (McCormick, 2017).
Extensions to heterogeneous loops, irregular computations, and hybrid dataflow scenarios, especially in the design of next-generation edge-AI accelerators (Li, 13 May 2025).
Formal optimization for balancing concurrency, register usage, spill cost, and cache utilization via constraint programming or analytical modeling (Domagala et al., 2014, Malas et al., 2014).
Unified frameworks to simultaneously exploit time and space locality across diverse architectures (CPU, GPU, FPGA), including hybrid tiling (temporal+spatial) and support for domain-specific languages.

A plausible implication is that as memory bandwidth remains a principal bottleneck for both scientific and AI workloads, sophisticated temporal tiling strategies and their automated derivation will be foundational elements of high-performance software and hardware toolchains.