Dice Question Streamline Icon: https://streamlinehq.com

Cache-miss explanation for PR speedups

Establish whether processing one supernode at a time with a reduced-space data structure in the partition refinement (PR) reordering implementation reduces cache misses and thereby explains the large observed reductions in runtime relative to the original PR implementation of Jacquelin, Ng, and Peyton (2018).

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces a reorganized implementation of the partition refinement (PR) method that processes and completely refines one supernode at a time, replacing an earlier implementation that refined partitions for all supernodes simultaneously.

This new implementation drastically reduces working storage (from many n-vectors to one n-vector plus a small number of N- and m'-vectors) and yields large runtime reductions in practice. The authors conjecture that fewer cache misses due to the smaller, more localized data structure are the likely cause of these runtime gains, but this causal explanation has not been established.

References

We conjecture that working on one supernode at a time, using a data structure that occupies much less space, greatly reduces the number of cache misses during the computation. We conjecture that this probably explains the large reductions in runtimes over those obtained using the original implementation in.

A comparison of two effective methods for reordering columns within supernodes (2501.08395 - Karsavuran et al., 14 Jan 2025) in Section 4.3 (Two improvements to the PR method)