Papers
Topics
Authors
Recent
Search
2000 character limit reached

Two-Level Blocking Update Framework

Updated 15 January 2026
  • The paper introduces a two-level blocking update framework that partitions computation into coarse and fine blocks, enhancing convergence and data reuse.
  • Methodologies employ structured block-coordinate proximal-gradient, spatial-temporal tiling, and code transformation to optimize performance on multicore and GPU systems.
  • Practical applications in imaging, PDE simulations, and optimization demonstrate significant speedups and reduced memory traffic compared to traditional methods.

The two-level blocking update framework refers to algorithmic and implementation strategies that hierarchically partition computation and data accesses along two distinct axes—typically a spatial dimension and a temporal (or finer spatial) dimension, or a hierarchical decomposition of problem variables—enabling improved data locality, parallelism, and convergence properties in a broad class of optimization and scientific computing algorithms. This approach has been applied in large-scale non-smooth, non-convex block-structured optimization, in efficient stencil computations on cache-based multicore and GPU systems, and in multi-level inverse problems in imaging and PDE simulation. The core principle is the introduction of two nested blockings, each exploiting a different aspect of algorithmic or hardware structure, yielding acceleration and resource efficiency over single-level methods.

1. Two-Level Blocking in Block-Coordinate Optimization

In non-smooth, non-convex block optimization, two-level blocking is formalized by defining variables as belonging to fine blocks, which are grouped into coarse blocks. Specifically, for variable space H==1LHH = \bigoplus_{\ell=1}^L H_\ell with x=(x1,,xL)x = (x_1, \ldots, x_L), a partition is constructed:

  • Coarse level (level–1): indices {1,,L}\{1,\ldots,L\} are partitioned into RR disjoint subsets IiI_i.
  • Fine level (level–2): each IiI_i is further split into LiL_i non-overlapping subsets Ji,1,,Ji,LiJ_{i,1},\ldots,J_{i,L_i}, indexing fine blocks B=(i,j)B=(i,j).

The associated update framework, exemplified by the Flexible Block-Coordinate Proximal–Gradient (FLEX-BC-PG) algorithm, enables structured block selection and prioritization, exploiting problem hierarchies (e.g., multi-resolution representations in imaging) to adapt the update sequence for maximal progress at both coarse and fine scales (Briceño-Arias et al., 30 Oct 2025). Selection mechanisms span weighted priorities, correlation rules, and essential cyclicity to guarantee full variable coverage.

The update at each iteration consists of computing partial gradients and blockwise forward-backward (proximal-gradient) steps on the chosen (i,j)(i,j) blocks:

  • Forward: xi,jn+12=xi,jnαi,jni,jf(xn)x_{i,j}^{n+\frac{1}{2}} = x_{i,j}^n - \alpha^n_{i,j} \nabla_{i,j} f(x^n)
  • Backward: xi,jn+1=proxαi,jngi,j(xi,jn+12)x_{i,j}^{n+1} = \mathrm{prox}_{\alpha^n_{i,j} g_{i,j}} \left( x_{i,j}^{n+\frac{1}{2}} \right)

All other blocks remain unmodified in each step.

2. Two-Level Blocking in Stencil Computation on Multicore Architectures

Two-level blocking in stencil computations encapsulates spatial blocking (e.g., diamond tiling) and temporal blocking (e.g., wavefront scheduling) (Malas et al., 2014, Malas et al., 2015). This combination allows maximal on-chip data reuse and exposes concurrency at several granularities.

  • Spatial blocking: The computation grid is partitioned into diamond-shaped tiles along a "slow" spatial dimension (e.g., yy in 3D Cartesian domains). Each diamond encompasses a block of updates arranged to exploit intra-cache reuse.
  • Temporal blocking: Within each spatial diamond, multiple time-steps ("wavefronts") are processed before the intermediate data is evicted from cache, leveraging temporal data locality.

Thread groups assign groups of cores to each diamond, which execute pipelined wavefront sweeps through the spatial tile. Dependencies within and across diamonds are managed via task (FIFO) queues and synchronization within each thread group.

Combined, this two-level blocking dramatically reduces the code balance (memory traffic per lattice update), for example from 1,216 bytes/LUP for typical spatial blocking to ~200–250 bytes/LUP, and achieves 2–4× speedup on bandwidth-bound stencil codes on multicore CPUs (Malas et al., 2015, Malas et al., 2014).

3. Two-Level Blocking on GPUs and Code Transformation

On fine-grained accelerators such as GPUs, two-level blocking is implemented by fusing spatial and temporal blocking within a tile processed by a thread block (Matsumura et al., 2020). The AN5D automated stencil framework exemplifies these strategies:

  • Spatial: Block all but one spatial dimension, streaming the remaining one; each block processes a contiguous spatial subdomain.
  • Temporal: Fuse multiple time-steps (bTb_T) of updates "in register," performing only one load and one store to global memory per cell for all bTb_T steps.
  • Memory hierarchy: Registers hold sub-planes for all bTb_T time-steps, while only two shared memory buffers are used (double-buffering), minimizing resource pressure independent of bTb_T.

A lightweight roofline performance model guides parameter selection (block size, bTb_T), and code transformation uses polyhedral scheduling to emit CUDA kernels that optimally traverse spatio-temporal tiles. For 2D 5-point stencils on NVIDIA Tesla V100, bTb_T up to 10 yields GFLOP/s scaling up to near architectural peak.

4. Convergence and Performance Guarantees

In non-smooth optimization, two-level blocking within FLEX-BC-PG yields strong convergence guarantees under block-Lipschitz, separability, and KŁ property assumptions. Provided every block is updated at least once in any KK consecutive iterations and step-sizes satisfy 0<αi,jn<1/β(i,j)0<\alpha_{i,j}^n<1/\beta_{(i,j)}, the sequence {xn}\{x^n\} has finite length and converges to a critical point of the composite objective:

  • n=0xn+1xn<\sum_{n=0}^\infty \|x^{n+1}-x^n\| < \infty
  • xnxcrit  Ψx^n \to x^* \in \text{crit} \; \Psi

The proof leverages sufficient decrease conditions for the composite objective, blockwise subgradient bounds, and application of the Kurdyka–Łojasiewicz inequality.

For stencil codes, performance models based on code balance and cache occupancy accurately predict the crossover points at which two-level blocking yields maximal benefit, and autotuning frameworks prune the configuration space to guarantee cache-fit and optimal throughput (Malas et al., 2015, Matsumura et al., 2020).

5. Practical Applications and Empirical Results

Optimization and Imaging

  • In wavelet-based deblurring with a two-level transform, two-level blocking exploits the coarse–fine wavelet hierarchy to suppress high-frequency artifacts early, then refines the details. The FLEX-BC-PG hierarchical update (8 coarse, then 2 full updates, alternated) achieves a 10-fold reduction in CPU time to target objective value over full or cyclic updates, and always outperforms random block selection (Briceño-Arias et al., 30 Oct 2025).

Stencil and PDE Computations

  • In THIIM-FDFD electromagnetics, multi-threaded two-level blocking (spatial diamonds + temporal wavefronts + intra-tile parallelism) achieves 3–4× speedup and a fivefold reduction in main memory traffic over standard spatial blocking, unlocking scalable performance up to all available cores (Malas et al., 2015).
  • In variable-coefficient Jacobi and 25-point stencils, memory traffic is reduced by up to 68%, sustaining 2–3× throughput improvement on mainstream multicore CPUs (Malas et al., 2014).

GPU Platforms

  • For regular 2D and higher-order stencils on V100 GPUs, two-level blocking via AN5D scales to bT10b_T\sim 10–$15$ temporal steps, achieving and sustaining near-peak throughput. Low-level optimizations such as fixed register allocation and minimal double-buffered shared memory are essential at scale (Matsumura et al., 2020).

6. Computational and Memory Model

A key unifying principle is optimizing the memory hierarchy at each blocking level:

Setting Blocking Levels Memory Traffic (bytes/LUP) Speedup vs Baseline
CPU Stencil (Malas et al., 2015) Spatial + Temporal ~250 (mWD) 3–4×
GPU Stencil (Matsumura et al., 2020) Spatial + Temporal Single load/store per b_T 2×\sim 2\times to peak
FLEX-BC-PG (Briceño-Arias et al., 30 Oct 2025) Coarse + Fine (var) 10× (imaging)

In all cases, the optimal parameter region depends on the cache or register file size, associated occupancy, and the particular structure of the compute kernel.

7. Impact and Generalizations

The two-level blocking update framework generalizes standard block or spatial blocking by exploiting intrinsic hierarchies—whether in variable structure, problem geometry, or hardware architecture—for substantial gains in convergence speed and computational efficiency. These methods admit theoretically justified convergence and performance guarantees, and are tractable to implement in practice via autotuning and automated code generation frameworks. This approach is expected to remain central in high-dimensional optimization, scientific simulation, and large-scale data-driven methods on multi-level memory and parallel architectures (Briceño-Arias et al., 30 Oct 2025, Malas et al., 2015, Malas et al., 2014, Matsumura et al., 2020).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Level Blocking Update Framework.