Papers
Topics
Authors
Recent
Search
2000 character limit reached

Blocked Matrix Multiply (DPAS)

Updated 4 June 2026
  • Blocked Matrix Multiply (DPAS) is an algorithm that partitions dense matrices into b×b blocks to enhance cache locality and improve GEMM efficiency.
  • The method reorders computations via various traversal schemes (k-outer, row-major, Morton) to maximize temporal reuse and minimize memory bandwidth constraints.
  • Empirical results show that selecting block sizes between 16 and 64 can yield 2–4× speedups over naive multiplication, demonstrating its effectiveness across platforms.

Blocked matrix multiply (also known as DPAS-style blocked matrix–matrix multiplication) is an algorithmic and architectural strategy for computing the product of two dense N×NN \times N matrices via partitioning them into b×bb \times b submatrices (blocks) and orchestrating both storage and traversal to optimize temporal and spatial locality. The method leverages the reuse of small matrix tiles in cache to enhance computational throughput and to minimize main-memory bandwidth limitations during general matrix–matrix multiplication (GEMM) routines. Blocked multiply strategies are central to high-performance dense linear algebra libraries, and cache-oblivious heuristics are vital for generalizing these gains across architectures without relying on low-level memory layout choices (0808.1108).

1. Blocked Matrix Multiply Formalism and Index Partitioning

Let AA, BB, and CC be dense N×NN \times N matrices to be multiplied as C=ABC = AB. Select a block size bb such that n=N/bn = N / b is integral; otherwise, matrices are padded to the next multiple of bb. Define submatrices as

b×bb \times b0

and analogously for b×bb \times b1 and b×bb \times b2 for b×bb \times b3. The blocked multiply is then given by

b×bb \times b4

where each submatrix operation reduces to a call to GEMM on b×bb \times b5 blocks.

2. Algorithmic Structure and Traversal Orders

The essence of blocked multiply is captured by iterating over all index triples b×bb \times b6 in a prescribed order:

  • k–outer loop:

n=N/bn = N / b4

  • i–j–k (row-major output order):

n=N/bn = N / b5

  • Morton/Peano recursion: Visit b×bb \times b7 in Z-curve or space-filling curve order, recursively descending on b×bb \times b8 block quadrants in b×bb \times b9-space, thereby maximizing temporal and spatial locality by minimizing working set size.

The particular traversal order (the "ORDER") of AA0 indices—rather than memory layout of blocks—is the determining factor for performance improvement when block sizes are sufficiently large to exploit cache locality (0808.1108).

3. Storage Layouts for Blocked Submatrices

Four canonical strategies exist for arranging the AA1 blocks of matrices AA2, AA3, and AA4 in physical memory (element size AA5 bytes):

Strategy Block Storage In-block Order
Contiguous row-major Linear in AA6 row-by-row
Contiguous column-major Linear in AA7 column-by-column
Interleaved row-stripe Rows of blocks contiguous (with interleaving for associativity) arbitrary
Morton/Peano (Z order) Blocks placed by interleaved AA8 bits arbitrary
  • Row-major: Block AA9 at BB0, element BB1 within block at BB2.
  • Column-major: Same scheme with BB3 reversed.
  • Interleaved row-stripe: Blocks BB4 are contiguous, with adjustable stride for cache associativity.
  • Morton/Z-order: Block BB5 at BB6; Morton index BB7 calculated by bit-interleaving digits of BB8 and BB9.

Within each block, use whichever layout (row-major or column-major) is optimal for the backend GEMM implementation. In practice, results reveal that the storage layout of blocks becomes irrelevant for performance as block size increases; execution order is the controlling parameter (0808.1108).

4. Execution Order Heuristics and Cache Model

Execution-order heuristics critically impact cache usage and data reuse:

  • Perfect-locality test: Repeats CC0, keeping all blocks resident in cache.
  • No-locality test: Randomizes all CC1 tuples and block allocations, maximally destroying temporal and spatial reuse.
  • k-ordered: CC2 outermost, CC3 innermost; maximizes temporal locality for CC4 and CC5 across multiple CC6 updates.
  • i–j–k (row-major): Processes all CC7 for every CC8 output block before moving on.
  • Morton/Peano recursion: Maintains a small working set by recursive subdivision of blocks.

The cache model is characterized as follows:

  • Block size CC9 with element size N×NN \times N0 implies each block is N×NN \times N1 bytes.
  • If N×NN \times N2 (N×NN \times N3 = cache size), blocks persist in cache across inner loops, ensuring temporal reuse.
  • Modern prefetchers provide sufficient spatial reuse within each block when N×NN \times N4 is several cache lines.
  • A block should not be evicted before completion of all necessary updates, otherwise temporal locality is lost.
  • Cache misses per block-multiply are predicted as
    • N×NN \times N5 for cold blocks,
    • N×NN \times N6 if two blocks remain cached from prior iteration.

5. Quantitative Performance and Scaling

Performance measurements on AMD Opteron 248 (2.2 GHz, 1 MiB L2) and Xeon Woodcrest (2.66 GHz, 4 MiB L2) use

N×NN \times N7

where N×NN \times N8 is CPU time. Speedup relative to "no locality" baseline is

N×NN \times N9

C=ABC = AB0 no-loc temp-only full Peano perfect-loc
8 1.0× 2.8× 2.9× 3.0×
16 1.0× 2.5× 2.6× 2.7×
32 1.0× 1.9× 2.0× 2.1×
64 1.0× 1.3× 1.4× 1.4×
128 1.0× 1.1× 1.1× 1.1×

On both platforms, speedup saturates for C=ABC = AB1, dropping to C=ABC = AB2; for C=ABC = AB3, speedup reaches C=ABC = AB4–C=ABC = AB5 over the "no locality" order (0808.1108).

6. Implementation Guidelines and Practical Recommendations

Key findings and prescriptions for DPAS-style implementation:

  • Temporal locality is dominant: Arrange C=ABC = AB6 traversal so that all block updates for a given C=ABC = AB7, C=ABC = AB8, C=ABC = AB9 triplet are completed before potential cache eviction.
  • Spatial locality (contiguous memory for blocks) is only consequential for prefetch efficiency within each GEMM call; block-to-block alignment is immaterial for bb0 with bb1 B lines.
  • Morton/Peano traversal realizes nearly ideal locality, but any schedule that processes all bb2 for each bb3 suffices.
  • Block size tuning:
    • Require bb4.
    • Avoid bb5 so small that GEMM vectorization degrades (bb6 typically bb7–bb8).
  • Blocks may be stored in any array structure (row-major is simplest).
  • Loop schedule: n=N/bn = N / b6 or use a small-recursion Z-curve that tiles the bb9 cube in n=N/bn = N / b0 block-octants.

By adhering to these guidelines, one achieves the n=N/bn = N / b1–n=N/bn = N / b2 speedups for small n=N/bn = N / b3 as empirically demonstrated, wholly independent of whether blocks are stored in specialized Morton or other layouts (0808.1108).

7. Conclusions and Implications

Blocked matrix multiply strategies prioritize temporal locality through thoughtful traversal of block operation triples, rather than through intricate block memory layouts. The major implication is that algorithmic scheduling affords near-optimal performance across platforms without the complexity of cache-oblivious data structures or elaborate physical reorganization. The performance benefits plateau with growing block size as spatial locality between blocks becomes negligible, and as BLAS-generic GEMM tuning dominates achievable throughput. Thus, future matrix–matrix multiplication designs should focus on execution order heuristics and cache residency constraints rather than specialized block placement, consistent with the empirical and analytic findings presented in the cited study (0808.1108).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Blocked Matrix Multiply (DPAS).