Tiled Memory Access

Updated 24 December 2025

Tiled memory access is a strategy that divides data and computations into small blocks to optimize cache reuse, reduce bandwidth, and enhance parallelism.
It underpins modern architectures such as computing-in-memory accelerators, CPU/GPU caching, and sparse data processing, achieving significant speedups and resource efficiency.
Advanced techniques like cache-aware tiling, warp-overlapped tiling, and fused tiling address memory constraints and synchronization challenges to maximize performance.

Tiled memory access refers to a class of memory access strategies that partition data and computations into small, regularly or irregularly shaped blocks—“tiles”—to exploit spatial and temporal locality, minimize bandwidth usage, and optimize parallelism across modern architectures. This paradigm underlies the design of efficient computing-in-memory accelerators, CPU/GPU caching strategies, sparse data access schemes, and AI system software, adapting to the specific constraints and opportunities of each target platform.

1. Fundamentals of Tiled Memory Access

Tiled memory access schemes aim to improve memory hierarchy utilization by dividing computational domains (e.g., matrices, tensors, grids) into sub-blocks or tiles. This arrangement allows data required for a group of computations to reside jointly in a fast-access memory (cache, scratchpad, local buffer), maximizing reuse and reducing expensive access to slower memory layers.

Mathematically, a $d$ -dimensional computation is split into tiles characterized by their extents $(T_1, ..., T_d)$ ; each tile contains

$F = \prod_{k=1}^d T_k$

data elements. The cumulative working set for each tile is designed not to exceed the chosen memory level’s capacity (e.g., $L_\text{L1}$ cache):

$A \cdot s \cdot \prod_{k=1}^d T_k \leq W_{L1}$

where $A$ is the number of independent arrays accessed and $s$ is the element size (Cashman, 26 Sep 2025).

Tiles may be static (decided at compile time, often for structured domains) or dynamic (using runtime analysis for unstructured domains). Access to each tile is optimized for the platform’s memory hierarchy and parallel structure, whether explicitly via autotuning, heuristics, cache-oblivious recursion, or measured latency boundaries (Cashman, 26 Sep 2025, Ranasinghe et al., 2018).

2. Tiled Access in Computing-in-Memory (CIM) and Specialized Accelerators

In RRAM-based computing-in-memory (CIM) accelerators, such as those described in CLSA-CIM, each tile typically maps to a physical processing element (PE) comprising an RRAM crossbar, I/O buffer, and supporting logic. The architecture statically partitions weights and activations into $M \times N$ tiles (submatrices), assigning each to a unique PE. The mapping is weight-stationary (no runtime remapping), enabling activations to be streamed efficiently to each PE’s buffer (Pelke et al., 15 Jan 2024).

Formally, for convolution-to-GEMM conversion:

$W \in \mathbb{R}^{(K_W K_H C_I) \times C_O}$

is block-partitioned into $M \times N$ tiles, and the PE requirement per layer $i$ is

$c_i = P_{V,i} \times P_{H,i}$

with $P_{V,i} = \left\lceil \frac{K_W K_H C_I}{N} \right\rceil,\, P_{H,i} = \left\lceil \frac{C_O}{M} \right\rceil$ .

Scheduling combines intra-layer ordering (ensuring resource exclusivity) with cross-layer global scheduling, maximizing parallel PE utilization under data dependency constraints. Proposed cross-layer algorithms (e.g., CLSA-CIM) have demonstrated order-of-magnitude speedup (up to $29.2\times$ ) and PE utilization improvements (up to $17.9\times$ ) relative to sequential, layer-wise baselines (Pelke et al., 15 Jan 2024).

3. Advanced Tiling Schemes for Multicore CPUs and GPUs

In CPU environments, temporal blocking and tiling can dramatically reduce main-memory bandwidth requirements for data-intensive patterns such as stencils. Multicore wavefront diamond blocking (MWD) combines space-time diamond tiling with wavefront parallelism and tunable thread groups. The diamond’s spatial and temporal extents are chosen to maximize in-cache temporal reuse:

Each diamond tile covers $A_\text{tile} = \frac{1}{2} D_w^2 - R D_w + D_w (N_F + 1)$ points (for width $D_w$ , stencil radius $R$ , extra frontlines $N_F$ ).
The effective bytes/LUP ratio is reduced from the naive $B_\text{naive}$ to

$B_\text{MWD} \simeq \frac{B_\text{naive}}{\tau}$

where $\tau$ is the number of time-levels per tile (Malas et al., 2014, Malas et al., 2015).

Register- and shared-memory-aware tiling is critical for GPU kernels. Model-based warp-overlapped tiling (OTPW) designs overlapped tiles mapped to warps (typically $32$ threads), using a hybrid region allocation: critical data resides in per-thread registers (exploiting warp shuffle), while larger regions are held in shared memory (Jangda et al., 2019). Trade-offs between register pressure, shared-memory utilization, and occupancy are analytically modeled and dynamically scheduled. This results in reduced synchronization overhead and improved global memory coalescence, yielding up to $2.25\times$ speedup over expert-tuned non-tiling schedules.

4. Tiling for Sparse, Irregular, and Unstructured Domains

Structured tiling is challenging for sparse or unstructured meshes where indirect or irregular memory accesses dominate. Sparse tiling, as formalized in the SLOPE compiler, partitions iteration spaces subject to dynamic dependence analysis. Via inspector-executor strategies, iteration sets are fused into tiles, coloring and reordering iterations to maximize reuse of shared data in cache while ensuring dependency correctness (Luporini et al., 2017). Similar techniques generalize to compressed sparse tensor tiling (e.g., GrateTile for CNNs), which partitions sparse tensors into uneven, “halo-aligned” subtensors to prevent redundant transfer, maintaining average DRAM bandwidth reductions of 54% at less than 1% metadata overhead (Lin et al., 2020).

The executor loops over colored tiles, executing all relevant operations per tile, so that indirectly accessed data (e.g. A[map[i]]) are loaded once and retained until all accesses complete. Indirect data reuse is thereby maximized, achieving measurable speedup on large-core clusters (up to $1.28\times$ ) (Luporini et al., 2017).

5. Cache-Aware and Cache-Oblivious Tiling: Modeling and Automation

Selecting optimal tile sizes depends on underlying cache hierarchy and empirical access patterns. Latency-Based Tiling (LBT) uses machine-timed microbenchmarks with triangular access probes to empirically measure access latency $t(W)$ versus working set size $W$ , identifying cache boundaries by peaks in the discrete derivative $d_i = \frac{t_{i+1} - t_i}{W_{i+1} - W_i}$ (Cashman, 26 Sep 2025). The resulting estimates,

$W_{L1} < W_{L2} < W_{L3}$

are used to bound per-tile footprints. Tiles are then sized via the constraint $A s \prod_k T_k \leq W_\text{cache}$ for each cache level.

Cache-oblivious tiling, in contrast, recursively partitions the computation via divide-and-conquer methods (e.g., PCOT (Ranasinghe et al., 2018)). This technique automatically exposes locality at all hierarchy levels by ensuring recursive bases match cache sizes at some recursion depth, yielding reduced off-chip traffic particularly for low-dimensional or bandwidth-limited kernels. Performance bounds demonstrate up to $60-80\%$ reductions in LLC misses compared to tuned single-level tiling, though with modest impact on peak compute-bound throughput due to recursion overhead and hardware prefetching.

6. Tiling in Domain-Specific Programming Models and AI Systems

Recent software frameworks, such as TileLang, encapsulate tiled memory access as composable abstractions. Explicit tile operators (copy, gemm, reduce) and scheduling annotations (thread binding, pipelining, layout) are separated, enabling high-level kernels to be mapped efficiently to heterogeneous accelerators. TileLang’s model represents each tile and memory region with formal layouts and systematically infers thread/data mapping, ensuring efficient movement between global, shared, and register memory tiers (Wang et al., 24 Apr 2025).

This leads to sustained state-of-the-art performance on kernels where tiled memory access is dominant, with bandwidth utilization consistently exceeding $90\%$ of hardware limits and kernel runtimes matching or surpassing hand-optimized implementations.

7. Specializations: Fused and Cross-Layer Tiling

Tiled memory access schemes can further be optimized by fusing entire chains of tiled layers. In the context of DNN processing on RISC-V SoCs with software-managed caches, Fused-Tiled Layers (FTL) compute adjacently tiled operators in situ—never materializing intermediates in slow memory. This is formalized via constraint programming: tile sizes are selected such that all buffers for the fused pipeline fit in on-chip SRAM. Empirical results yield a 47.1% reduction in memory transfers and 60.1% runtime reduction on vision transformer MLP stages (Jung et al., 21 Mar 2025). Algorithmic approaches in cross-layer scheduling for CIM (e.g., CLSA-CIM) exploit similar concepts by globally mapping and issuing tiles as soon as data and compute are available, minimizing idle cycles and achieving up to $29.2\times$ speedup (Pelke et al., 15 Jan 2024).

Tiled memory access is a defining methodology shaping both hardware and software for high-performance, energy-efficient computation. Advanced schemes—including temporal blocking, hybrid tiling, cache- and topology-aware partitioning, sparse/halo-aligned blocks, and composable programming models—continue to set the frontier, closing the gap between raw data movement cost and peak algorithmic throughput across diverse computational domains (Pelke et al., 15 Jan 2024, Malas et al., 2015, Malas et al., 2014, Cashman, 26 Sep 2025, Luporini et al., 2017, Lin et al., 2020, Ranasinghe et al., 2018, Jung et al., 21 Mar 2025, Wang et al., 24 Apr 2025, Jangda et al., 2019).