Blocked Matrix Multiply (DPAS)
- Blocked Matrix Multiply (DPAS) is an algorithm that partitions dense matrices into b×b blocks to enhance cache locality and improve GEMM efficiency.
- The method reorders computations via various traversal schemes (k-outer, row-major, Morton) to maximize temporal reuse and minimize memory bandwidth constraints.
- Empirical results show that selecting block sizes between 16 and 64 can yield 2–4× speedups over naive multiplication, demonstrating its effectiveness across platforms.
Blocked matrix multiply (also known as DPAS-style blocked matrix–matrix multiplication) is an algorithmic and architectural strategy for computing the product of two dense matrices via partitioning them into submatrices (blocks) and orchestrating both storage and traversal to optimize temporal and spatial locality. The method leverages the reuse of small matrix tiles in cache to enhance computational throughput and to minimize main-memory bandwidth limitations during general matrix–matrix multiplication (GEMM) routines. Blocked multiply strategies are central to high-performance dense linear algebra libraries, and cache-oblivious heuristics are vital for generalizing these gains across architectures without relying on low-level memory layout choices (0808.1108).
1. Blocked Matrix Multiply Formalism and Index Partitioning
Let , , and be dense matrices to be multiplied as . Select a block size such that is integral; otherwise, matrices are padded to the next multiple of . Define submatrices as
0
and analogously for 1 and 2 for 3. The blocked multiply is then given by
4
where each submatrix operation reduces to a call to GEMM on 5 blocks.
2. Algorithmic Structure and Traversal Orders
The essence of blocked multiply is captured by iterating over all index triples 6 in a prescribed order:
- k–outer loop:
4
- i–j–k (row-major output order):
5
- Morton/Peano recursion: Visit 7 in Z-curve or space-filling curve order, recursively descending on 8 block quadrants in 9-space, thereby maximizing temporal and spatial locality by minimizing working set size.
The particular traversal order (the "ORDER") of 0 indices—rather than memory layout of blocks—is the determining factor for performance improvement when block sizes are sufficiently large to exploit cache locality (0808.1108).
3. Storage Layouts for Blocked Submatrices
Four canonical strategies exist for arranging the 1 blocks of matrices 2, 3, and 4 in physical memory (element size 5 bytes):
| Strategy | Block Storage | In-block Order |
|---|---|---|
| Contiguous row-major | Linear in 6 | row-by-row |
| Contiguous column-major | Linear in 7 | column-by-column |
| Interleaved row-stripe | Rows of blocks contiguous (with interleaving for associativity) | arbitrary |
| Morton/Peano (Z order) | Blocks placed by interleaved 8 bits | arbitrary |
- Row-major: Block 9 at 0, element 1 within block at 2.
- Column-major: Same scheme with 3 reversed.
- Interleaved row-stripe: Blocks 4 are contiguous, with adjustable stride for cache associativity.
- Morton/Z-order: Block 5 at 6; Morton index 7 calculated by bit-interleaving digits of 8 and 9.
Within each block, use whichever layout (row-major or column-major) is optimal for the backend GEMM implementation. In practice, results reveal that the storage layout of blocks becomes irrelevant for performance as block size increases; execution order is the controlling parameter (0808.1108).
4. Execution Order Heuristics and Cache Model
Execution-order heuristics critically impact cache usage and data reuse:
- Perfect-locality test: Repeats 0, keeping all blocks resident in cache.
- No-locality test: Randomizes all 1 tuples and block allocations, maximally destroying temporal and spatial reuse.
- k-ordered: 2 outermost, 3 innermost; maximizes temporal locality for 4 and 5 across multiple 6 updates.
- i–j–k (row-major): Processes all 7 for every 8 output block before moving on.
- Morton/Peano recursion: Maintains a small working set by recursive subdivision of blocks.
The cache model is characterized as follows:
- Block size 9 with element size 0 implies each block is 1 bytes.
- If 2 (3 = cache size), blocks persist in cache across inner loops, ensuring temporal reuse.
- Modern prefetchers provide sufficient spatial reuse within each block when 4 is several cache lines.
- A block should not be evicted before completion of all necessary updates, otherwise temporal locality is lost.
- Cache misses per block-multiply are predicted as
- 5 for cold blocks,
- 6 if two blocks remain cached from prior iteration.
5. Quantitative Performance and Scaling
Performance measurements on AMD Opteron 248 (2.2 GHz, 1 MiB L2) and Xeon Woodcrest (2.66 GHz, 4 MiB L2) use
7
where 8 is CPU time. Speedup relative to "no locality" baseline is
9
| 0 | no-loc | temp-only | full Peano | perfect-loc |
|---|---|---|---|---|
| 8 | 1.0× | 2.8× | 2.9× | 3.0× |
| 16 | 1.0× | 2.5× | 2.6× | 2.7× |
| 32 | 1.0× | 1.9× | 2.0× | 2.1× |
| 64 | 1.0× | 1.3× | 1.4× | 1.4× |
| 128 | 1.0× | 1.1× | 1.1× | 1.1× |
On both platforms, speedup saturates for 1, dropping to 2; for 3, speedup reaches 4–5 over the "no locality" order (0808.1108).
6. Implementation Guidelines and Practical Recommendations
Key findings and prescriptions for DPAS-style implementation:
- Temporal locality is dominant: Arrange 6 traversal so that all block updates for a given 7, 8, 9 triplet are completed before potential cache eviction.
- Spatial locality (contiguous memory for blocks) is only consequential for prefetch efficiency within each GEMM call; block-to-block alignment is immaterial for 0 with 1 B lines.
- Morton/Peano traversal realizes nearly ideal locality, but any schedule that processes all 2 for each 3 suffices.
- Block size tuning:
- Require 4.
- Avoid 5 so small that GEMM vectorization degrades (6 typically 7–8).
- Blocks may be stored in any array structure (row-major is simplest).
- Loop schedule: 6 or use a small-recursion Z-curve that tiles the 9 cube in 0 block-octants.
By adhering to these guidelines, one achieves the 1–2 speedups for small 3 as empirically demonstrated, wholly independent of whether blocks are stored in specialized Morton or other layouts (0808.1108).
7. Conclusions and Implications
Blocked matrix multiply strategies prioritize temporal locality through thoughtful traversal of block operation triples, rather than through intricate block memory layouts. The major implication is that algorithmic scheduling affords near-optimal performance across platforms without the complexity of cache-oblivious data structures or elaborate physical reorganization. The performance benefits plateau with growing block size as spatial locality between blocks becomes negligible, and as BLAS-generic GEMM tuning dominates achievable throughput. Thus, future matrix–matrix multiplication designs should focus on execution order heuristics and cache residency constraints rather than specialized block placement, consistent with the empirical and analytic findings presented in the cited study (0808.1108).